Re: asynchronous execution

Started by Robert Haasover 9 years ago65 messages

robertmhaas@gmail.com

over 9 years ago

1 attachment(s)

[ Adjusting subject line to reflect the actual topic of discussion better. ]

On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Ah, yeah, something like that could happen. I've spent much of this
week working on a new design for this feature which I think will avoid
this problem. It doesn't work yet - in fact I can't even really test
it yet. But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.

Well, I promised to post this, so here it is. It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

Some notes:

- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

async-wip-2016-09-23.patchbinary/octet-stream; name=async-wip-2016-09-23.patchDownload

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index daf0438..ab69aa3 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -343,6 +344,14 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
 							JoinPathExtraData *extra);
 static bool postgresRecheckForeignScan(ForeignScanState *node,
 						   TupleTableSlot *slot);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -455,6 +464,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4342,6 +4357,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Create a tuple from the specified row of the PGresult.
  *
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 47158f6..e7e55c0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4934,7 +4944,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4944,6 +4954,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6218,3 +6229,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4fa3661..e5282b5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
 	HeapTuple  *es_epqTuple;	/* array of EPQ substitute tuples */
 	bool	   *es_epqTupleSet; /* true if EPQ tuple is provided */
 	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1141,17 +1185,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.5.4 (Apple Git-61)

Amit Khandekar

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Robert Haas (#1)

On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:

Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation. It will not call
ExecAsyncWaitForNode() unless that node supports asynchronism. Do you
feel there is anywhere else in the implementation that is really
causing this degrade ? That previous implementation has some issues,
but they seemed solvable. We could resolve the plan state recursion
issue by explicitly making sure the same plan state does not get
called again while it is already executing.

Thanks
-Amit Khandekar

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 9 years ago

In reply to: Robert Haas (#1)

Sorry for delayed response, I'll have enough time from now and
address this.

At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com>

Well, I promised to post this, so here it is. It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

Some notes:

- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 9 years ago

In reply to: Amit Khandekar (#2)

Hello, thank you for the comment.

At Wed, 28 Sep 2016 10:00:08 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in <CAJ3gD9fRmEhUoBMnNN8K_QrHZf7m4rmOHTFDj492oeLZff8o=w@mail.gmail.com>

On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:

Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation.

The basic thought is that we don't allow degradation of as small
as around one percent for simple cases in exchange for this
feature (or similar ones).

Very simple case of SeqScan runs through a very short path, on
where prediction failure penalties of CPU by few branches results
in visible impact. I avoided that by using likely/unlikly but
more fundamental measure is preferable.

It will not call ExecAsyncWaitForNode() unless that node
supports asynchronism.

That's true, but it takes a certain amount of CPU cycle to decide
call it or not. The small bit of time is the issue in focus now.

Do you feel there is anywhere else in
the implementation that is really causing this degrade ? That
previous implementation has some issues, but they seemed
solvable. We could resolve the plan state recursion issue by
explicitly making sure the same plan state does not get called
again while it is already executing.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 9 years ago

In reply to: Robert Haas (#1)

Thank you for the thought.

At Fri, 23 Sep 2016 21:09:03 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com>

[ Adjusting subject line to reflect the actual topic of discussion better. ]

On Fri, Sep 23, 2016 at 9:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Ah, yeah, something like that could happen. I've spent much of this
week working on a new design for this feature which I think will avoid
this problem. It doesn't work yet - in fact I can't even really test
it yet. But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.

Well, I promised to post this, so here it is. It's not really working
all that well at this point, and it's definitely not doing anything
that interesting, but you can see the outline of what I have in mind.
Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

The previous framework didn't need to distinguish async-capable
and uncapable nodes from the parent node's view. The things in
ExecProcNode was required for the reason. Instead, this new one
removes the ExecProcNode stuff by distinguishing the two kinds of
node in async-aware parents, that is, Append. This no longer
involves async-unaware nodes into the tuple bubbling-up mechanism
so the reentrant problem doesn't seem to occur.

On the other hand, for example, the following plan, regrardless
of its practicality, (there should be more good example..)

(Async-unaware node)
- NestLoop
- Append
- n * ForegnScan
- Append
- n * ForegnScan

If the NestLoop, Append are async-aware, all of the ForeignScans
can run asynchronously with the previous framework. The topmost
NestLoop will be awakened after that firing of any ForenScans
makes a tuple bubbles up to the NestLoop. This is because the
not-need-to-distinguish-aware-or-not nature provided by the
ExecProcNode stuff.

On the other hand, with the new one, in order to do the same
thing, ExecAppend have in turn to behave differently whether the
parent is async or not. To do this will be bothersome but not
with confidence.

I examine this further intensively, especially for performance
degeneration and obstacles to complete this.

Some notes:

- EvalPlanQual rechecks are broken.
- EXPLAIN ANALYZE instrumentation is broken.
- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.
- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.
- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.
- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.
- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#2)

On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:

Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation.

I think we need some testing to prove that one way or the other. If
you can do some - say on a plan with multiple nested loop joins with
inner index-scans, which will call ExecProcNode() a lot - that would
be great. I don't think we can just rely on "it doesn't seem like it
should be slower", though - ExecProcNode() is too important a function
for us to guess at what the performance will be.

The thing I'm really worried about with either implementation is what
happens when we start to add asynchronous capability to multiple
nodes. For example, if you imagine a plan like this:

Append
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan

In order for this to run asynchronously, you need not only Append and
Foreign Scan to be async-capable, but also Hash Join. That's true in
either approach. Things are slightly better with the original
approach, but the basic problem is there in both cases. So it seems
we need an approach that will make adding async capability to a node
really cheap, which seems like it might be a problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Khandekar

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Robert Haas (#6)

On 4 October 2016 at 02:30, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Sep 28, 2016 at 12:30 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 24 September 2016 at 06:39, Robert Haas <robertmhaas@gmail.com> wrote:

Since Kyotaro Horiguchi found that my previous design had a
system-wide performance impact due to the ExecProcNode changes, I
decided to take a different approach here: I created an async
infrastructure where both the requestor and the requestee have to be
specifically modified to support parallelism, and then modified Append
and ForeignScan to cooperate using the new interface. Hopefully that
means that anything other than those two nodes will suffer no
performance impact. Of course, it might have other problems....

I see that the reason why you re-designed the asynchronous execution
implementation is because the earlier implementation showed
performance degradation in local sequential and local parallel scans.
But I checked that the ExecProcNode() changes were not that
significant as to cause the degradation.

I think we need some testing to prove that one way or the other. If
you can do some - say on a plan with multiple nested loop joins with
inner index-scans, which will call ExecProcNode() a lot - that would
be great. I don't think we can just rely on "it doesn't seem like it
should be slower"

Agreed. I will come up with some tests.

, though - ExecProcNode() is too important a function
for us to guess at what the performance will be.

Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.

The thing I'm really worried about with either implementation is what
happens when we start to add asynchronous capability to multiple
nodes. For example, if you imagine a plan like this:

Append
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan
-> Hash Join
-> Foreign Scan
-> Hash
-> Seq Scan

In order for this to run asynchronously, you need not only Append and
Foreign Scan to be async-capable, but also Hash Join. That's true in
either approach. Things are slightly better with the original
approach, but the basic problem is there in both cases. So it seems
we need an approach that will make adding async capability to a node
really cheap, which seems like it might be a problem.

Yes, we might have to deal with this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#7)

On Tue, Oct 4, 2016 at 7:53 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Also, parent pointers are not required in the new design. Thinking of
parent pointers, now it seems the event won't get bubbled up the tree
with the new design. But still, , I think it's possible to switch over
to the other asynchronous tree when some node in the current subtree
is waiting. But I am not sure, will think more on that.

The bubbling-up still happens, because each node that made an async
request gets a callback with the final response - and if it is itself
the recipient of an async request, it can use that callback to respond
to that request in turn. This version doesn't bubble up through
non-async-aware nodes, but that might be a good thing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#5)

7 attachment(s)

Hello, this works but ExecAppend gets a bit degradation.

At Mon, 03 Oct 2016 19:46:32 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161003.194632.204401048.horiguchi.kyotaro@lab.ntt.co.jp>

Some notes:

- EvalPlanQual rechecks are broken.

This is fixed by adding (restoring) async-cancelation.

- EXPLAIN ANALYZE instrumentation is broken.

EXPLAIN ANALYE seems working but async-specific information is
not available yet.

- ExecReScanAppend is broken, because the async stuff needs some way
of canceling an async request and I didn't invent anything like that
yet.

Fixed as EvalPlanQual.

- The postgres_fdw changes pretend to be async but aren't actually.
It's just a demo of (part of) the interface at this point.

Applied my previous patch with some modification.

- The postgres_fdw changes also report all pg-fdw paths as
async-capable, but actually the direct-modify ones aren't, so the
regression tests fail.

All actions other than scan does vacate_connection() to use a
connection.

- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.

WaitEventSet itself is not leaked but epoll-fd should be closed
at failure. This seems doable with TRY-CATCHing in
ExecAsyncEventLoop. (not yet)

- There are probably other bugs, too.

Whee!

Note that I've tried to solve the re-entrancy problems by (1) putting
all of the event loop's state inside the EState rather than in local
variables and (2) having the function that is called to report arrival
of a result be thoroughly different than the function that is used to
return a tuple to a synchronous caller.

Comments welcome, if you're feeling brave enough to look at anything
this half-baked.

This doesn't cause reentry since this no longer bubbles up
tupples through async-unaware nodes. This framework passes tuples
through private channels for requestor and requestees.

Anyway, I amended this and made postgres_fdw async and then
finally all regtests passed with minor modifications. The
attached patches are the following.

0001-robert-s-2nd-framework.patch
The patch Robert shown upthread

0002-Fix-some-bugs.patch
A small patch to fix complation errors of 0001.

0003-Modify-async-execution-infrastructure.patch
Several modifications on the infrastructure. The details are
shown after the measurement below.

0004-Make-postgres_fdw-async-capable.patch
True-async postgres-fdw.

gentblr.sql, testrun.sh, calc.pl
Performance test script suite.

gentblr.sql - creates test tables.
testrun.sh - does single test run and
calc.pl - drives testrunc.sh and summirize its results.

I measured performance and had the following result.

t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time<ms> (std dev <ms>)"

sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)

async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster

t0 is not affected since the ExecProcNode stuff has gone.

pl is getting a bit slower. (almost the same to simple seqscan of
the previous patch) This should be a misprediction penalty.

pf0, pf1 are faster as expected.

========

The below is a summary of modifications made by 0002 and 0003 patch.

execAsync.c, execnodes.h:

- Added include "pgstat.h" to use WAIT_EVENT_ASYNC_WAIT.

- Changed the interface of ExecAsyncRequest to return if a tuple is
immediately available or not.

- Made ExecAsyncConfigureWait to return if it registered at least
one waitevent or not. This is used to know the caller
(ExecAsyncEventWait) has a event to wait (for safety).

If two or more postgres_fdw nodes are sharing one connection,
only one of them can be waited at once. It is a
responsibility to the FDW drivers to ensure at least one wait
event to be added but on failure WaitEventSetWait silently
waits forever.

- There were separate areq->callback_pending and
areq->request_complete but they are altering together so they are
replaced with one state variable areq->state. New enum
AsyncRequestState for areq->state in execnodes.h.

nodeAppend.c:

- Return a tuple immediately if ExecAsyncRequest says that a
tuple is available.

- Reduced nest level of for(;;).

nodeForeignscan.[ch], fdwapi.h, execProcnode.c::

- Calling postgresIterateForeignScan can yield tuples in wrong
shape. Call ExecForeignScan instead.

- Changed the interface of AsyncConfigureWait as execAsync.c.

- Added ShutdownForeignScan interface.

createplan.c, ruleutils.c, plannodes.h:

- With the Rebert's change, explain shows somewhat odd plans
where the Output of Append is named after non-parent
child. This does not harm but uneasy. Added index of the
parent in Append.referent to make it reasoable. (But this
looks ugly..). Still children in explain are in different
order from definition. (expected/postgres_fdw.out is edited)

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From 1af1d3ca952e6a241852d7b9b27be50915c8b0cc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/4] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index daf0438..ab69aa3 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -343,6 +344,14 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
 							JoinPathExtraData *extra);
 static bool postgresRecheckForeignScan(ForeignScanState *node,
 						   TupleTableSlot *slot);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -455,6 +464,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4342,6 +4357,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Create a tuple from the specified row of the PGresult.
  *
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 47158f6..e7e55c0 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4934,7 +4944,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4944,6 +4954,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6218,3 +6229,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
 	HeapTuple  *es_epqTuple;	/* array of EPQ substitute tuples */
 	bool	   *es_epqTupleSet; /* true if EPQ tuple is provided */
 	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From 2879fc2643e0916431def8a281ac9eb3c58794ee Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/4] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d97e694..6677bc4 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5082,12 +5082,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5110,12 +5110,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5138,12 +5138,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5166,12 +5166,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5230,120 +5230,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -5371,26 +5371,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -5398,19 +5398,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -5577,8 +5577,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ab69aa3..6da5843 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4374,7 +4375,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5112d6d..558bb8f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1c9bf13..40c6d08 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From b21c0792ae9efb5e0c3db787b6be118ea5ff9938 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/4] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6677bc4..d429790 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5230,13 +5230,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -5244,10 +5244,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5267,13 +5267,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -5281,10 +5281,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5305,22 +5305,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5328,16 +5328,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -5371,8 +5371,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -5389,8 +5389,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -5398,8 +5398,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 6da5843..997bd6c 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -348,7 +348,7 @@ static bool postgresRecheckForeignScan(ForeignScanState *node,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4379,11 +4379,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 23b4e18..72d8cd6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc5b938..1ebdc48 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e7e55c0..c73bbb3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4944,7 +4959,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4955,6 +4970,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 8a81d7a..de0e96c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4056,7 +4056,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 1337546fee26e5a80372c090acebc8bc53de3508 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/4] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d429790..a53fff4 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5082,12 +5082,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5110,12 +5110,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5138,12 +5138,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5166,12 +5166,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -5259,9 +5259,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -5296,9 +5296,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -5561,27 +5561,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 997bd6c..c2b5b17 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -34,6 +34,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -52,6 +53,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -121,10 +125,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -135,7 +156,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -151,6 +172,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -164,11 +192,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -191,6 +219,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -289,6 +318,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -349,8 +379,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -373,7 +403,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -434,6 +467,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1314,12 +1348,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1375,32 +1418,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1416,7 +1553,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1424,6 +1561,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1452,9 +1592,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1472,7 +1612,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1480,16 +1620,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1691,7 +1847,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1770,6 +1928,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1780,14 +1940,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1795,10 +1955,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1836,6 +1996,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1856,14 +2018,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1871,10 +2033,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1912,6 +2074,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1932,14 +2096,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1947,10 +2111,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1997,16 +2161,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2286,7 +2450,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2339,7 +2505,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2386,8 +2555,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2505,6 +2674,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2547,6 +2717,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2826,11 +3006,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2896,47 +3076,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -2946,27 +3175,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3050,7 +3334,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3060,12 +3344,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3073,9 +3357,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3206,9 +3490,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3216,10 +3500,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4365,8 +4649,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4375,22 +4661,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4438,7 +4761,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 67126bc..9eff0ba 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -100,6 +101,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f68e89..de1d96e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1248,8 +1248,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

gentblr.sqltext/plain; charset=us-asciiDownload

testrun.shtext/plain; charset=us-asciiDownload

calc.pltext/plain; charset=us-asciiDownload

#10

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#9)

6 attachment(s)

This is the rebased version on the current master(-0004), and
added resowner stuff (0005) and unlikely(0006).

At Tue, 18 Oct 2016 10:30:51 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp>

- Errors in the executor can leak the WaitEventSet. Probably we need
to modify ResourceOwners to be able to own WaitEventSets.

WaitEventSet itself is not leaked but epoll-fd should be closed
at failure. This seems doable with TRY-CATCHing in
ExecAsyncEventLoop. (not yet)

Haha, that's a silly talk. The wait event can continue to live
when timeout and any error can happen on the way after the
that. I added an entry for wait event set to resource owner and
hang ones created in ExecAsyncEventWait to
TopTransactionResourceOwner. Currently WaitLatchOrSocket doesn't
do so not to change the current behavior. WaitEventSet doesn't
have usable identifier for resowner.c so currently I use the
address(pointer value) for the purpose. The patch 0005 does that.

I measured performance and had the following result.

t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time<ms> (std dev <ms>)"

sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)

async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster

t0 is not affected since the ExecProcNode stuff has gone.

pl is getting a bit slower. (almost the same to simple seqscan of
the previous patch) This should be a misprediction penalty.

Using likely macro for ExecAppend, and it seems to have shaken
off the degradation.

sync
t0: 3919.49 ( 5.95)
pl: 1637.95 ( 0.75)
pf0: 8304.20 ( 43.94)
pf1: 8222.09 ( 28.20)

async
t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
pl: 1617.20 ( 3.51) 1.26% faster (ditto)
pf0: 6680.95 (478.72) 19.5% faster
pf1: 1886.87 ( 36.25) 77.1% faster

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From 25eba7e506228ab087e8b743efb039286a8251c4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/6] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..c480945 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
 	HeapTuple  *es_epqTuple;	/* array of EPQ substitute tuples */
 	bool	   *es_epqTupleSet; /* true if EPQ tuple is provided */
 	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From 4493e6d2d43a5864e9d381cb69270246e0c6234c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/6] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 88b696c..f9fd172 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6181,12 +6181,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6209,12 +6209,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6237,12 +6237,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6265,12 +6265,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6329,120 +6329,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6470,26 +6470,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6497,19 +6497,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6676,8 +6676,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c480945..e75f8a1 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5112d6d..558bb8f 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 1c9bf13..40c6d08 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From 126ed476a6d41e5cfb54be387123ac3a8e9963d0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/6] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index f9fd172..4b76e41 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6329,13 +6329,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6343,10 +6343,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6366,13 +6366,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6380,10 +6380,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6404,22 +6404,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6427,16 +6427,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6470,8 +6470,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -6488,8 +6488,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6497,8 +6497,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e75f8a1..830212f 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 23b4e18..72d8cd6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc5b938..1ebdc48 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 8a81d7a..de0e96c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4056,7 +4056,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 62d27e1420de596dbd6a3ecdae1dc1d0a51116cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/6] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4b76e41..ca69074 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6181,12 +6181,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6209,12 +6209,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6237,12 +6237,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6265,12 +6265,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6358,9 +6358,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6395,9 +6395,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6660,27 +6660,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 830212f..9244e51 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index bb9d41a..d4b5fad 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload

From 233e2e5125cdea90fa10fc05dd5ff1885f09cff2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/6] Use resource owner to prevent wait event set from leaking

Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
 src/backend/executor/execAsync.c      | 28 ++++++++++++++--
 src/backend/storage/ipc/latch.c       | 19 ++++++++++-
 src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
 src/include/utils/resowner_private.h  |  8 +++++
 4 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
+#include "utils/resowner_private.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
+		ResourceOwner savedOwner;
+
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		 * of external FDs are likely to run afoul of kernel limits anyway.
 		 */
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
-		estate->es_wait_event_set =
-			CreateWaitEventSet(estate->es_query_cxt,
-							   estate->es_allocated_fd_events + 1);
+
+		/*
+		 * The wait event set created here should be released in case of
+		 * error.
+		 */
+		savedOwner = CurrentResourceOwner;
+		CurrentResourceOwner = TopTransactionResourceOwner;
+
+		PG_TRY();
+		{
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_allocated_fd_events + 1);
+		}
+		PG_CATCH();
+		{
+			CurrentResourceOwner = savedOwner;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 8488f94..b8bcae9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set;
+	ResourceOwner savedOwner = CurrentResourceOwner;
+
+	/* This function doesn't need resowner for event set */
+	CurrentResourceOwner = NULL;
+	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	char	   *data;
 	Size		sz = 0;
 
+	if (CurrentResourceOwner)
+		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	set->resowner = CurrentResourceOwner;
+	if (CurrentResourceOwner)
+		ResourceOwnerRememberWES(set->resowner, set);
 	return set;
 }
 
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..272e460 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -702,6 +715,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->waiteventarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1270,3 +1285,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 11749cc592ac8369fcc9fbfb362ddd2a6f2f0a90 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/6] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

#11

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#10)

1 attachment(s)

Hi, this is the 7th patch to make instrumentation work.

Explain analyze shows the following result by the previous patch set .

| Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4324.676..4324.676
| rows=1 loops=1)
| -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=0.910..3663.8
|82 rows=4000000 loops=1)
| -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4)
| (never executed)
| -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4)
| (actual time=0.004..0.004 rows=0 loops=1)

The current instrument stuff assumes that requested tuple always
returns a tuple or the end of tuple comes. This async framework
has two point of executing underneath nodes. ExecAsyncRequest and
ExecAsyncEventLoop. So I'm not sure if this is appropriate but
anyway it seems to show sane numbers.

| Aggregate (cost=820.25..820.26 rows=1 width=8) (actual time=4571.205..4571.206
| rows=1 loops=1)
| -> Append (cost=0.00..791.00 rows=11701 width=4) (actual time=1.362..3893.1
|14 rows=4000000 loops=1)
| -> Foreign Scan on ft10 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=1.056..770.863 rows=1000000 loops=1)
| -> Foreign Scan on ft20 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=0.461..767.840 rows=1000000 loops=1)
| -> Foreign Scan on ft30 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=0.474..782.547 rows=1000000 loops=1)
| -> Foreign Scan on ft40 (cost=100.00..197.75 rows=2925 width=4)
| (actual time=0.156..765.920 rows=1000000 loops=1)
| -> Seq Scan on pf0 (cost=0.00..0.00 rows=1 width=4) (never executed)

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload

From 35c60a46f49aab72d492c798ff7eb8fc0e672250 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++++++++++++++++++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+						  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				if (requestor == areq->requestor)
 					requestor_done = true;
 				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
-- 
2.9.2

#12

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#11)

7 attachment(s)

Hello,

I'm not sure this is in a sutable shape for commit fest but I
decided to register this to ride on the bus for 10.0.

Hi, this is the 7th patch to make instrumentation work.

This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.

0001-robert-s-2nd-framework.patch

Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.

0002-Fix-some-bugs.patch

Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.

0003-Modify-async-execution-infrastructure.patch

The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.

0004-Make-postgres_fdw-async-capable.patch

Makes postgres_fdw to work asynchronously.

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch

This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.

0007-Add-instrumentation-to-async-execution.patch

As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.

A result of a performance measurement is in this message.

/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp

| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From 8519a24a85a0d033ae9b6ddcc175f5948bb90b76 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..c480945 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 71714bc..23b4e18 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index ae86954..dc5b938 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
 	HeapTuple  *es_epqTuple;	/* array of EPQ substitute tuples */
 	bool	   *es_epqTupleSet; /* true if EPQ tuple is provided */
 	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From c0d26333dd549343ab0658aace4389b1ea60eedb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 2745ad5..1b36579 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6321,120 +6321,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6462,26 +6462,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6489,19 +6489,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6668,8 +6668,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c480945..e75f8a1 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..ca91dd8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e8dac6..87ce505 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From 75eec490d5fa5e7272066ab35bba30c8c00e87cf Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1b36579..a98e138 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6321,13 +6321,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6335,10 +6335,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6358,13 +6358,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6372,10 +6372,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6396,22 +6396,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6419,16 +6419,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6462,8 +6462,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -6480,8 +6480,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6489,8 +6489,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index e75f8a1..830212f 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 23b4e18..72d8cd6 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index dc5b938..1ebdc48 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 8a81d7a..de0e96c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4056,7 +4056,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 04d33f89391ad8aedfa9b13a2dd72f87f19c3ae1 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index a98e138..38dc55b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6350,9 +6350,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6387,9 +6387,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6652,27 +6652,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 830212f..9244e51 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..7153661 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload

From 616e4186479fda5f7f5d87f2fd2e6b9d0fa9f603 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking

Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
 src/backend/executor/execAsync.c      | 28 ++++++++++++++--
 src/backend/storage/ipc/latch.c       | 19 ++++++++++-
 src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
 src/include/utils/resowner_private.h  |  8 +++++
 4 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
+#include "utils/resowner_private.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
+		ResourceOwner savedOwner;
+
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		 * of external FDs are likely to run afoul of kernel limits anyway.
 		 */
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
-		estate->es_wait_event_set =
-			CreateWaitEventSet(estate->es_query_cxt,
-							   estate->es_allocated_fd_events + 1);
+
+		/*
+		 * The wait event set created here should be released in case of
+		 * error.
+		 */
+		savedOwner = CurrentResourceOwner;
+		CurrentResourceOwner = TopTransactionResourceOwner;
+
+		PG_TRY();
+		{
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_allocated_fd_events + 1);
+		}
+		PG_CATCH();
+		{
+			CurrentResourceOwner = savedOwner;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 8488f94..b8bcae9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set;
+	ResourceOwner savedOwner = CurrentResourceOwner;
+
+	/* This function doesn't need resowner for event set */
+	CurrentResourceOwner = NULL;
+	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	char	   *data;
 	Size		sz = 0;
 
+	if (CurrentResourceOwner)
+		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	set->resowner = CurrentResourceOwner;
+	if (CurrentResourceOwner)
+		ResourceOwnerRememberWES(set->resowner, set);
 	return set;
 }
 
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..272e460 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -702,6 +715,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->waiteventarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1270,3 +1285,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 4129f613956b2e87fb924533b28ea44a7f7e3dc3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload

From 13872b3aed2cf7627af8cd4d009712574c7c9ad5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++++++++++++++++++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+						  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				if (requestor == areq->requestor)
 					requestor_done = true;
 				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
-- 
2.9.2

#13

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#12)

7 attachment(s)

Hello, this is a maintenance post of reased patches.
I added a change of ResourceOwnerData missed in 0005.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>

This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.

0001-robert-s-2nd-framework.patch

Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.

0002-Fix-some-bugs.patch

Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.

0003-Modify-async-execution-infrastructure.patch

The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.

0004-Make-postgres_fdw-async-capable.patch

Makes postgres_fdw to work asynchronously.

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch

This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.

0007-Add-instrumentation-to-async-execution.patch

As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.

A result of a performance measurement is in this message.

/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp

| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From f1c33db03494975bdf3ef5a9856a5c99041f0e55 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fbe6929..ef4acc7 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04e49b7..e4a103f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 748b687..1566e0d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 917e6c8..69453b5 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1519,6 +1519,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..b50b41c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -347,6 +347,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -422,6 +441,31 @@ typedef struct EState
 	HeapTuple  *es_epqTuple;	/* array of EPQ substitute tuples */
 	bool	   *es_epqTupleSet; /* true if EPQ tuple is provided */
 	bool	   *es_epqScanDone; /* true if EPQ tuple has been fetched */
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1147,17 +1191,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From f2aa3c04fc79163bb45e9e122b151b39110d3cd7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 785f520..457cfdb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6321,120 +6321,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6462,26 +6462,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6489,19 +6489,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6668,8 +6668,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ef4acc7..c64ae41 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a392197..ca91dd8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3393,6 +3393,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 4e8dac6..87ce505 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From a4d81f284e9eac5e60c2dfe7e9f693acae73ab36 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 457cfdb..083d947 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6321,13 +6321,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6335,10 +6335,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6358,13 +6358,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6372,10 +6372,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6396,22 +6396,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6419,16 +6419,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6462,8 +6462,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -6480,8 +6480,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6489,8 +6489,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c64ae41..b92b279 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e4a103f..27ccf9d 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1566e0d..c8b9f31 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 69453b5..8443a62 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1520,6 +1520,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index a3a4174..9a2ee83 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4079,7 +4079,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b50b41c..0c6af86 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -360,8 +367,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From f708cece53a6dd21647478b0fba6a8b4dff992a0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 083d947..15519c1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6350,9 +6350,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6387,9 +6387,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6652,27 +6652,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b92b279..21e7fd9 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..7153661 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload

From 951d3a9ee1bc73c63428620c0c02b225451275c2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking

Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
 src/backend/executor/execAsync.c      | 28 ++++++++++++++--
 src/backend/storage/ipc/latch.c       | 19 ++++++++++-
 src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
 src/include/utils/resowner_private.h  |  8 +++++
 4 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
+#include "utils/resowner_private.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
+		ResourceOwner savedOwner;
+
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		 * of external FDs are likely to run afoul of kernel limits anyway.
 		 */
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
-		estate->es_wait_event_set =
-			CreateWaitEventSet(estate->es_query_cxt,
-							   estate->es_allocated_fd_events + 1);
+
+		/*
+		 * The wait event set created here should be released in case of
+		 * error.
+		 */
+		savedOwner = CurrentResourceOwner;
+		CurrentResourceOwner = TopTransactionResourceOwner;
+
+		PG_TRY();
+		{
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_allocated_fd_events + 1);
+		}
+		PG_CATCH();
+		{
+			CurrentResourceOwner = savedOwner;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 8488f94..b8bcae9 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set;
+	ResourceOwner savedOwner = CurrentResourceOwner;
+
+	/* This function doesn't need resowner for event set */
+	CurrentResourceOwner = NULL;
+	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	char	   *data;
 	Size		sz = 0;
 
+	if (CurrentResourceOwner)
+		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	set->resowner = CurrentResourceOwner;
+	if (CurrentResourceOwner)
+		ResourceOwnerRememberWES(set->resowner, set);
 	return set;
 }
 
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 07075ce..0b590c1 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -702,6 +715,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -728,6 +742,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1270,3 +1285,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 4a49530e1b4d968ae067819bf872049ebfae48eb Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload

From 9d6a9444aea28c2880ecbedcaa3d721150d4a988 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++++++++++++++++++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+						  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				if (requestor == areq->requestor)
 					requestor_done = true;
 				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
-- 
2.9.2

#14

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#13)

Hello,

I cannot respond until next Monday, so I move this to the next CF
by myself.

At Tue, 15 Nov 2016 20:25:13 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161115.202513.268072050.horiguchi.kyotaro@lab.ntt.co.jp>

Hello, this is a maintenance post of reased patches.
I added a change of ResourceOwnerData missed in 0005.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>

This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.

0001-robert-s-2nd-framework.patch

Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.

0002-Fix-some-bugs.patch

Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.

0003-Modify-async-execution-infrastructure.patch

The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.

0004-Make-postgres_fdw-async-capable.patch

Makes postgres_fdw to work asynchronously.

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch

This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.

0007-Add-instrumentation-to-async-execution.patch

As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.

A result of a performance measurement is in this message.

/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp

| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 9 years ago

In reply to: Kyotaro HORIGUCHI (#14)

7 attachment(s)

This patch conflicts with e13029a (es_query_dsa) so I rebased
this.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>

This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.

0001-robert-s-2nd-framework.patch

Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.

0002-Fix-some-bugs.patch

Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.

0003-Modify-async-execution-infrastructure.patch

The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.

0004-Make-postgres_fdw-async-capable.patch

Makes postgres_fdw to work asynchronously.

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch

This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.

0007-Add-instrumentation-to-async-execution.patch

As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.

A result of a performance measurement is in this message.

/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp

| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From 68e8bbb5996f8a3605b440933d59bbd12268269a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fbe6929..ef4acc7 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4442,6 +4457,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 2587ef7..9fcc4e4 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -464,11 +464,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..bb06569 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..85d436f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -355,3 +355,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index d973225..a4b31cc 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -218,6 +218,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7258c03..c59c635 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -359,6 +359,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d608530..8051c58 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1521,6 +1521,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index ad49674..7caa8d3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -270,6 +270,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -955,8 +956,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1001,7 +1011,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4941,7 +4951,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4951,6 +4961,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6225,3 +6236,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..81a079d 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3e69ab0 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..88feb9a 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5c3b868..7b0e145 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -430,6 +449,31 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area   *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1165,17 +1209,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index e2fbc7d..327119b 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -208,6 +208,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From f63728704995dd9b147a2f94778e1c1ad05da517 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 785f520..457cfdb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6321,120 +6321,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6462,26 +6462,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6489,19 +6489,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6668,8 +6668,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index ef4acc7..c64ae41 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4474,7 +4475,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 61e6a2c..beae80b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3392,6 +3392,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 282f8ae..a42ad48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -785,7 +785,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From a951a6124c00297d825323199a30c8c570ca46b4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 457cfdb..083d947 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6321,13 +6321,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6335,10 +6335,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6358,13 +6358,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6372,10 +6372,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6396,22 +6396,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6419,16 +6419,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6462,8 +6462,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -6480,8 +6480,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6489,8 +6489,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index c64ae41..b92b279 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4479,11 +4479,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bb06569..c234f1f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 85d436f..d3567bb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -378,7 +378,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -386,7 +386,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a4b31cc..ad64649 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -219,6 +219,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index c59c635..829e826 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -360,6 +360,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 8051c58..7f72c99 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1522,6 +1522,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7caa8d3..ff1d663 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -193,7 +193,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -960,6 +961,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -985,7 +988,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -999,9 +1009,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1011,7 +1025,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -4951,7 +4966,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -4962,6 +4977,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 4e2ba19..47135fe 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4241,7 +4241,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3e69ab0..47a3920 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 88feb9a..65517fd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7b0e145..139bd8e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -357,6 +357,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -365,8 +372,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 327119b..1df6693 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -209,6 +209,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 20a519a37bb2667427d1c857466bd220d9fe0bf9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index bcdddc2..ebc9417 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 083d947..15519c1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6173,12 +6173,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6201,12 +6201,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6229,12 +6229,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6257,12 +6257,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6350,9 +6350,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6387,9 +6387,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6652,27 +6652,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index b92b279..21e7fd9 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1337,12 +1371,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1398,32 +1441,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1439,7 +1576,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1447,6 +1584,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1475,9 +1615,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1495,7 +1635,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1503,16 +1643,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1714,7 +1870,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1793,6 +1951,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1803,14 +1963,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1818,10 +1978,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1859,6 +2019,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1879,14 +2041,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1894,10 +2056,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1935,6 +2097,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1955,14 +2119,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1970,10 +2134,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2020,16 +2184,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2309,7 +2473,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2362,7 +2528,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2409,8 +2578,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2529,6 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2572,6 +2742,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2926,11 +3106,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2996,47 +3176,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3046,27 +3275,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3150,7 +3434,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3160,12 +3444,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3173,9 +3457,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3306,9 +3590,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3316,10 +3600,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4465,8 +4749,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4475,22 +4761,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4850,7 +5173,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f8c255e..1800977 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index f48743c..7153661 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1552,8 +1552,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..f864abe 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -114,6 +114,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -806,6 +807,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 65517fd..e40db0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload

From 9ad5ab969809960a5d954aed086743e04a963e2e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking

Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
 src/backend/executor/execAsync.c      | 28 ++++++++++++++--
 src/backend/storage/ipc/latch.c       | 19 ++++++++++-
 src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
 src/include/utils/resowner_private.h  |  8 +++++
 4 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
+#include "utils/resowner_private.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
+		ResourceOwner savedOwner;
+
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		 * of external FDs are likely to run afoul of kernel limits anyway.
 		 */
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
-		estate->es_wait_event_set =
-			CreateWaitEventSet(estate->es_query_cxt,
-							   estate->es_allocated_fd_events + 1);
+
+		/*
+		 * The wait event set created here should be released in case of
+		 * error.
+		 */
+		savedOwner = CurrentResourceOwner;
+		CurrentResourceOwner = TopTransactionResourceOwner;
+
+		PG_TRY();
+		{
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_allocated_fd_events + 1);
+		}
+		PG_CATCH();
+		{
+			CurrentResourceOwner = savedOwner;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index b7e5129..90a93cc 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set;
+	ResourceOwner savedOwner = CurrentResourceOwner;
+
+	/* This function doesn't need resowner for event set */
+	CurrentResourceOwner = NULL;
+	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	char	   *data;
 	Size		sz = 0;
 
+	if (CurrentResourceOwner)
+		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	set->resowner = CurrentResourceOwner;
+	if (CurrentResourceOwner)
+		ResourceOwnerRememberWES(set->resowner, set);
 	return set;
 }
 
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index cdc460b..46c2531 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index fd32090..6087257e7 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 5c39b606ada4ed4c84d4aea283ada6f19a90913a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index c234f1f..e82547d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload

From 252216d9348e6d32894d15732f6991a3c770baf3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++++++++++++++++++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+						  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				if (requestor == areq->requestor)
 					requestor_done = true;
 				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2614bf4..6a22a15 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
-- 
2.9.2

#16

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#15)

7 attachment(s)

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

At Mon, 31 Oct 2016 10:39:12 +0900 (JST), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20161031.103912.217430542.horiguchi.kyotaro@lab.ntt.co.jp>

This a PoC patch of asynchronous execution feature, based on a
executor infrastructure Robert proposed. These patches are
rebased on the current master.

0001-robert-s-2nd-framework.patch

Roberts executor async infrastructure. Async-driver nodes
register its async-capable children and sync and data transfer
are done out of band of ordinary ExecProcNode channel. So async
execution no longer disturbs async-unaware node and slows them
down.

0002-Fix-some-bugs.patch

Some fixes for 0001 to work. This is just to preserve the shape
of 0001 patch.

0003-Modify-async-execution-infrastructure.patch

The original infrastructure doesn't work when multiple foreign
tables is on the same connection. This makes it work.

0004-Make-postgres_fdw-async-capable.patch

Makes postgres_fdw to work asynchronously.

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patch

This addresses a problem pointed by Robers about 0001 patch,
that WaitEventSet used for async execution can leak by errors.

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

ExecAppend gets a bit slower by penalties of misprediction of
branches. This fixes it by using unlikely() macro.

0007-Add-instrumentation-to-async-execution.patch

As the description above for 0001, async infrastructure conveys
tuples outside ExecProcNode channel so EXPLAIN ANALYZE requires
special treat to show sane results. This patch tries that.

A result of a performance measurement is in this message.

/messages/by-id/20161025.182150.230901487.horiguchi.kyotaro@lab.ntt.co.jp

| t0 - SELECT sum(a) FROM <local single table>;
| pl - SELECT sum(a) FROM <4 local children>;
| pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
| pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;
...
| async
| t0: 3885.84 ( 40.20) 0.86% faster (should be error but stable on my env..)
| pl: 1617.20 ( 3.51) 1.26% faster (ditto)
| pf0: 6680.95 (478.72) 19.5% faster
| pf1: 1886.87 ( 36.25) 77.1% faster

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From 5ef5ae125e758f221dcacbb1391ba3a517ec0a9f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 1/7] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..595a47e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4440,6 +4455,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 2a2b7eb..dd05d1e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+	   execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execReplication.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index d380207..e154c59 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -468,11 +468,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..e61218a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 86a77e3..61899d1 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -353,3 +353,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 30d733e..a8cabdf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1560ac3..a894a9d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index dcfa6ee..67439ec 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1539,6 +1539,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fae1f67..968f8be 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -272,6 +272,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -961,8 +962,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -997,7 +1000,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1007,7 +1017,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5009,7 +5019,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5019,6 +5029,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6330,3 +6341,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index f0e942a..5a61306 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..4c50f1e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f9bcdd6..29f3d7c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -352,6 +352,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -430,6 +449,31 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area   *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1175,17 +1219,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f72f7a8..f0daada 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From 4675717734d12d404b1d66a734866b3f85830244 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 2/7] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 3a09280..d7420e0 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6402,120 +6402,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6543,26 +6543,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6570,19 +6570,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6749,8 +6749,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 595a47e..f180838 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4472,7 +4473,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..af59f51 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3398,6 +3398,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..7769d3c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -787,7 +787,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From 60a9ba9e74666dba290f6bf27225384966d272a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 3/7] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d7420e0..fd8b628 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6402,13 +6402,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6416,10 +6416,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6439,13 +6439,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6453,10 +6453,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6477,22 +6477,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6500,16 +6500,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6543,8 +6543,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -6561,8 +6561,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6570,8 +6570,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index f180838..abb256b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4477,11 +4477,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e61218a..568fa25 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 61899d1..85dad79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -376,7 +376,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -384,7 +384,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8cabdf..c62aaf2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -237,6 +237,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a894a9d..c2e34a8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -370,6 +370,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67439ec..9837eff 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1540,6 +1540,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 968f8be..a9164ab 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -966,6 +967,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1005,9 +1015,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1017,7 +1031,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5019,7 +5034,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5030,6 +5045,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index f26175e..37fc817 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4242,7 +4242,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 5a61306..2d9a62b 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 4c50f1e..41fc76f 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 29f3d7c..9b43fd6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -357,6 +357,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -365,8 +372,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f0daada..ebbc78d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -229,6 +229,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 28025ab53215ce0cbbfc690bf053e600afa9fb4d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 4/7] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..64cc057 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index fd8b628..5d448d1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6431,9 +6431,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6468,9 +6468,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6733,27 +6733,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index abb256b..a52d54a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1335,12 +1369,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1396,32 +1439,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1437,7 +1574,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1445,6 +1582,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1473,9 +1613,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1493,7 +1633,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1501,16 +1641,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1712,7 +1868,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1791,6 +1949,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1801,14 +1961,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1816,10 +1976,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1857,6 +2017,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1877,14 +2039,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1892,10 +2054,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1933,6 +2095,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1953,14 +2117,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1968,10 +2132,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2018,16 +2182,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2307,7 +2471,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2360,7 +2526,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2407,8 +2576,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2527,6 +2696,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2570,6 +2740,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2924,11 +3104,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2994,47 +3174,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3044,27 +3273,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3148,7 +3432,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3158,12 +3442,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3171,9 +3455,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3304,9 +3588,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3314,10 +3598,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4463,8 +4747,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4473,22 +4759,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4848,7 +5171,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index e19a3ef..3ae12bc 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 0dd95c6..1cba31e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -820,6 +821,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 41fc76f..11c3434 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload

From 991c5a4a14a841123237cd370fc1ec4756fad352 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 5/7] Use resource owner to prevent wait event set from leaking

Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
 src/backend/executor/execAsync.c      | 28 ++++++++++++++--
 src/backend/storage/ipc/latch.c       | 19 ++++++++++-
 src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
 src/include/utils/resowner_private.h  |  8 +++++
 4 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
+#include "utils/resowner_private.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
+		ResourceOwner savedOwner;
+
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		 * of external FDs are likely to run afoul of kernel limits anyway.
 		 */
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
-		estate->es_wait_event_set =
-			CreateWaitEventSet(estate->es_query_cxt,
-							   estate->es_allocated_fd_events + 1);
+
+		/*
+		 * The wait event set created here should be released in case of
+		 * error.
+		 */
+		savedOwner = CurrentResourceOwner;
+		CurrentResourceOwner = TopTransactionResourceOwner;
+
+		PG_TRY();
+		{
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_allocated_fd_events + 1);
+		}
+		PG_CATCH();
+		{
+			CurrentResourceOwner = savedOwner;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d45a41d..3b64e83 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set;
+	ResourceOwner savedOwner = CurrentResourceOwner;
+
+	/* This function doesn't need resowner for event set */
+	CurrentResourceOwner = NULL;
+	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	char	   *data;
 	Size		sz = 0;
 
+	if (CurrentResourceOwner)
+		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	set->resowner = CurrentResourceOwner;
+	if (CurrentResourceOwner)
+		ResourceOwnerRememberWES(set->resowner, set);
 	return set;
 }
 
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..34c7e37 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 01abb362be9f30dfe324d5d05a0717d375c3fc57 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 6/7] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 568fa25..9c07b49 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload

From 7939f913ee610ece749fa4c5acacb0301308f503 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 7/7] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++++++++++++++++++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+						  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				if (requestor == areq->requestor)
 					requestor_done = true;
 				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
-- 
2.9.2

#17

Michael Paquier

michael.paquier@gmail.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#16)

On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

The patches still apply, moved to CF 2017-03. Be aware of that:
$ git diff HEAD~6 --check
contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces.
+                           PendingAsyncRequest *areq,
contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces.
+                           bool reinit);
src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.
-- 
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Michael Paquier (#17)

Thank you.

At Wed, 1 Feb 2017 14:11:58 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqS0MhZrzgMVQeFEnnKABcsMnNULd8=O0PG7_h-FUp5aEQ@mail.gmail.com>

On Tue, Jan 31, 2017 at 12:45 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.
The patches still apply, moved to CF 2017-03. Be aware of that:
$ git diff HEAD~6 --check
contrib/postgres_fdw/postgres_fdw.c:388: indent with spaces.
+                           PendingAsyncRequest *areq,
contrib/postgres_fdw/postgres_fdw.c:389: indent with spaces.
+                           bool reinit);
src/backend/utils/resowner/resowner.c:1332: new blank line at EOF.

Thank you for letting me know the command. I changed my check
scripts to use them and it seems working fine on both commit and
rebase.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Antonin Houska

ah@cybertec.at

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#16)

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*************** ExecAsyncEventWait(EState *estate, long
*** 364,369 ****
--- 364,370 ----

if ((w->events & WL_LATCH_SET) != 0)
{
+ ResetLatch(MyLatch);
process_latch_set = true;
continue;
}

Actually _almost_ fixed because at some point one of the following

Assert(areq->state == ASYNC_WAITING);

statements fired. I think it was the immediately following one, but I can
imagine the same to happen in the branch

if (process_latch_set)
...

I think the wants_process_latch field of PendingAsyncRequest is not useful
alone because the process latch can be set for reasons completely unrelated to
the asynchronous processing. If the asynchronous node should use latch to
signal it's readiness, I think an additional flag is needed in the request
which tells ExecAsyncEventWait that the latch was set by the asynchronous
node.

BTW, do we really need the ASYNC_CALLBACK_PENDING state? I can imagine the
async node either to change ASYNC_WAITING directly to ASYNC_COMPLETE, or leave
it ASYNC_WAITING if the data is not ready.

In addition, the following comments are based only on code review, I didn't
verify my understanding experimentally:

* Isn't it possible for AppendState.as_asyncresult to contain multiple
responses from the same async node? Since the array stores TupleTableSlot
instead of the actual tuple (so multiple items of as_asyncresult point to
the same slot), I suspect the slot contents might not be defined when the
Append node eventually tries to return it to the upper plan.

* For the WaitEvent subsystem to work, I think postgres_fdw should keep a
separate libpq connection per node, not per user mapping. Currently the
connections are cached by user mapping, but it's legal to locate multiple
child postgres_fdw nodes of Append plan on the same remote server. I expect
that these "co-located" nodes would currently use the same user mapping and
therefore the same connection.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Antonin Houska (#19)

On Fri, Feb 3, 2017 at 5:04 AM, Antonin Houska <ah@cybertec.at> wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

I was lucky enough to see an infinite loop when using this patch, which I
fixed by this change:
diff --git a/src/backend/executor/execAsync.c
b/src/backend/executor/execAsync.c
new file mode 100644
index 588ba18..9b87fbd
*** a/src/backend/executor/execAsync.c
--- b/src/backend/executor/execAsync.c
*************** ExecAsyncEventWait(EState *estate, long
*** 364,369 ****
--- 364,370 ----
if ((w->events & WL_LATCH_SET) != 0)
{
+ ResetLatch(MyLatch);
process_latch_set = true;
continue;
}

Hi, I've been testing this patch because seemed like it would help a use
case of mine, but can't tell if it's currently working for cases other than
a local parent table that has many child partitions which happen to be
foreign tables. Is it? I was hoping to use it for a case like:

select x, sum(y) from one_remote_table
union all
select x, sum(y) from another_remote_table
union all
select x, sum(y) from a_third_remote_table

but while aggregates do appear to be pushed down, it seems that the remote
tables are being queried in sequence. Am I doing something wrong?

#21

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#16)

Horiguchi-san,

On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#0 0x0000003969632625 in *__GI_raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003969633e05 in *__GI_abort () at abort.c:92
#2 0x000000000098b22c in ExceptionalCondition (conditionName=0xb30e02
"!(added)", errorType=0xb30d77 "FailedAssertion", fileName=0xb30d50
"execAsync.c",
lineNumber=345) at assert.c:54
#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345
#4 0x0000000000687ed5 in ExecAsyncEventLoop (estate=0x13c01b8,
requestor=0x13c1640, timeout=-1) at execAsync.c:186
#5 0x00000000006a5170 in ExecAppend (node=0x13c1640) at nodeAppend.c:257
#6 0x0000000000692b9b in ExecProcNode (node=0x13c1640) at execProcnode.c:411
#7 0x00000000006bf4d7 in ExecResult (node=0x13c1170) at nodeResult.c:113
#8 0x0000000000692b5c in ExecProcNode (node=0x13c1170) at execProcnode.c:399
#9 0x00000000006a596b in fetch_input_tuple (aggstate=0x13c06a0) at
nodeAgg.c:587
#10 0x00000000006a8530 in agg_fill_hash_table (aggstate=0x13c06a0) at
nodeAgg.c:2272
#11 0x00000000006a7e76 in ExecAgg (node=0x13c06a0) at nodeAgg.c:1910
#12 0x0000000000692d69 in ExecProcNode (node=0x13c06a0) at execProcnode.c:514
#13 0x00000000006c1a42 in ExecSort (node=0x13c03d0) at nodeSort.c:103
#14 0x0000000000692d3f in ExecProcNode (node=0x13c03d0) at execProcnode.c:506
#15 0x000000000068e733 in ExecutePlan (estate=0x13c01b8,
planstate=0x13c03d0, use_parallel_mode=0 '\000', operation=CMD_SELECT,
sendTuples=1 '\001',
numberTuples=0, direction=ForwardScanDirection, dest=0x7fa368ee1da8)
at execMain.c:1609
#16 0x000000000068c751 in standard_ExecutorRun (queryDesc=0x135c568,
direction=ForwardScanDirection, count=0) at execMain.c:341
#17 0x000000000068c5dc in ExecutorRun (queryDesc=0x135c568,
<snip>

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR: relation "public.ptab_00010" does not exist
CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301 lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1 0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
at resowner.c:566
#4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5 0x0000000000524172 in AbortTransaction () at xact.c:2588
#6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228

There is a segfault when accessing the events variable, whose members seem
to be pfreed:

(gdb) f 2
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600 ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = {
nevents = 2139062143,
nevents_space = 2139062143,
resowner = 0x7f7f7f7f7f7f7f7f,
events = 0x7f7f7f7f7f7f7f7f,
latch = 0x7f7f7f7f7f7f7f7f,
latch_pos = 2139062143,
epoll_fd = 2139062143,
epoll_ret_events = 0x7f7f7f7f7f7f7f7f
}

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Amit Langote (#21)

Thank you very much for testing this!

At Tue, 7 Feb 2017 13:28:42 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <9058d70b-a6b0-8b3c-091a-fe77ed0df580@lab.ntt.co.jp>

Horiguchi-san,

On 2017/01/31 12:45, Kyotaro HORIGUCHI wrote:

I noticed that this patch is conflicting with 665d1fa (Logical
replication) so I rebased this. Only executor/Makefile
conflicted.

With the latest set of patches, I observe a crash due to an Assert failure:

#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345

This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

Yeah, it seems to me unrelated to how many they are.

There is a crash in one more case, which seems related to how WaitEventSet
objects are manipulated during resource-owner-mediated cleanup of a failed
query, such as after the FDW returned an error like below:

ERROR: relation "public.ptab_00010" does not exist
CONTEXT: Remote SQL command: SELECT a, b FROM public.ptab_00010

The backtrace in this looks like below:

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
301 lastidx = resarr->lastidx;
(gdb)
(gdb) bt
#0 0x00000000009c4c35 in ResourceArrayRemove (resarr=0x7f7f7f7f7f7f80bf,
value=20645152) at resowner.c:301
#1 0x00000000009c6578 in ResourceOwnerForgetWES
(owner=0x7f7f7f7f7f7f7f7f, events=0x13b0520) at resowner.c:1317
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
#3 0x00000000009c5338 in ResourceOwnerReleaseInternal (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1 '\001')
at resowner.c:566
#4 0x00000000009c5155 in ResourceOwnerRelease (owner=0x12de768,
phase=RESOURCE_RELEASE_BEFORE_LOCKS, isCommit=0 '\000', isTopLevel=1
'\001') at resowner.c:485
#5 0x0000000000524172 in AbortTransaction () at xact.c:2588
#6 0x0000000000524854 in AbortCurrentTransaction () at xact.c:3016
#7 0x0000000000836aa6 in PostgresMain (argc=1, argv=0x12d7b08,
dbname=0x12d7968 "postgres", username=0x12d7948 "amit") at postgres.c:3860
#8 0x00000000007a49d8 in BackendRun (port=0x12cdf00) at postmaster.c:4310
#9 0x00000000007a4151 in BackendStartup (port=0x12cdf00) at postmaster.c:3982
#10 0x00000000007a0885 in ServerLoop () at postmaster.c:1722
#11 0x000000000079febf in PostmasterMain (argc=3, argv=0x12aacc0) at
postmaster.c:1330
#12 0x00000000006e7549 in main (argc=3, argv=0x12aacc0) at main.c:228

There is a segfault when accessing the events variable, whose members seem
to be pfreed:

(gdb) f 2
#2 0x0000000000806098 in FreeWaitEventSet (set=0x13b0520) at latch.c:600
600 ResourceOwnerForgetWES(set->resowner, set);
(gdb) p *set
$5 = {
nevents = 2139062143,
nevents_space = 2139062143,
resowner = 0x7f7f7f7f7f7f7f7f,
events = 0x7f7f7f7f7f7f7f7f,
latch = 0x7f7f7f7f7f7f7f7f,
latch_pos = 2139062143,
epoll_fd = 2139062143,
epoll_ret_events = 0x7f7f7f7f7f7f7f7f
}

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

Great thanks for the report.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#22)

13 attachment(s)

At Thu, 16 Feb 2017 21:06:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170216.210600.214980879.horiguchi.kyotaro@lab.ntt.co.jp>

#3 0x00000000006883ed in ExecAsyncEventWait (estate=0x13c01b8,
timeout=-1) at execAsync.c:345

This means no pending fdw scan didn't let itself go to waiting
stage. It leads to a stuck of the whole things. This is caused if
no one acutually is waiting for result. I suppose that all of the
foreign scans ran on the same connection. Anyway it should be a
mistake in state transition. I'll look into it.

...

I was running a query whose plan looked like:

explain (costs off) select tableoid::regclass, a, min(b), max(b) from ptab
group by 1,2 order by 1;
QUERY PLAN
------------------------------------------------------
Sort
Sort Key: ((ptab.tableoid)::regclass)
-> HashAggregate
Group Key: (ptab.tableoid)::regclass, ptab.a
-> Result
-> Append
-> Foreign Scan on ptab_00001
-> Foreign Scan on ptab_00002
-> Foreign Scan on ptab_00003
-> Foreign Scan on ptab_00004
-> Foreign Scan on ptab_00005
-> Foreign Scan on ptab_00006
-> Foreign Scan on ptab_00007
-> Foreign Scan on ptab_00008
-> Foreign Scan on ptab_00009
-> Foreign Scan on ptab_00010
<snip>

The snipped part contains Foreign Scans on 90 more foreign partitions (in
fact, I could see the crash even with 10 foreign table partitions for the
same query).

Yeah, it seems to me unrelated to how many they are.

Finally, I couldn't see the crash for the (maybe) same case. I
can guess two reasons for this. One is that a situation where
node->as_nasyncpending differs from estate->es_num_pending_async,
but I couldn't find a possibility. Another is a situation in
postgresIterateForeignScan where the "next owner" reaches eof but
another waiter is not. I haven't reproduce the situation but
fixed it for the case. Addition to that I found a bug in
ExecAsyncAppendResponse. It calls bms_add_member inappropriate
way.

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

The cause was that the WaitEventSet was placed in ExecutorState
but registered to TopTransactionResourceOwner. I fixed it.

This fixes are made on top of the previous patches for now. In
the attached files, 0008, 0009 are for the second bug, 0012 is
for the first bug. And 0013 is for bms bug.

Sorry for the confused patches, I will resend more neater ones
soon.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0013-Fix-a-bug-of-a-usage-of-bms_add_member.patchtext/x-patch; charset=us-asciiDownload

From 995f2133c9cb651de46d8c9506537f72e0546b82 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 17:30:12 +0900
Subject: [PATCH 13/13] Fix a bug of a usage of bms_add_member.

bms_add_member may change the location of the struct. Reassign is
mandatory. This can cause a bug for more than 32 members.
---
 src/backend/executor/nodeAppend.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 9293139..109435d 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -428,5 +428,6 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	 * Mark the node that returned a result as ready for a new request.  We
 	 * don't launch another one here immediately because it might compelte
 	 */
-	bms_add_member(node->as_needrequest, areq->request_index);
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
 }
-- 
2.9.2

0012-Fix-a-possible-bug.patchtext/x-patch; charset=us-asciiDownload

From fab48a75f2bdb89ba96068938da759b6e67682c2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 17:28:49 +0900
Subject: [PATCH 12/13] Fix a possible bug.

I haven't found that but calling postgresIterateForeignScan on eof'ed
node can cause a crash. This may fix it.
---
 contrib/postgres_fdw/postgres_fdw.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 04f520b..6b694d0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -1449,6 +1449,16 @@ postgresIterateForeignScan(ForeignScanState *node)
 	{
 		ForeignScanState *next_conn_owner = node;
 
+		/*
+		 * This can be called for eof'ed node, do nothing other than returning
+		 * null tuple for the case
+		 */
+		if (GetPgFdwScanState(node)->eof_reached)
+		{
+			fsstate->result_ready = true;
+			return ExecClearTuple(slot);
+		}
+
 		/* This node has sent a query on this connection */
 		if (fsstate->s.connpriv->current_owner == node)
 		{
-- 
2.9.2

0011-Some-non-functional-fixes.patchtext/x-patch; charset=us-asciiDownload

From 98e271051e93a1a10d0b4f45939f18e6cbe01367 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 17:23:42 +0900
Subject: [PATCH 11/13] Some non-functional fixes.

Rename items of AsyncRequestState. Rewrite some comments, and some
struct members are renamed for readability.
---
 contrib/postgres_fdw/postgres_fdw.c | 66 +++++++++++++++++++------------------
 src/backend/executor/execAsync.c    | 62 ++++++++++++++++------------------
 src/backend/executor/nodeAppend.c   |  4 +--
 src/include/nodes/execnodes.h       |  8 ++---
 4 files changed, 69 insertions(+), 71 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index a52d54a..04f520b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -129,17 +129,17 @@ enum FdwDirectModifyPrivateIndex
 /*
  * Connection private area structure.
  */
- typedef struct PgFdwConnspecate
+typedef struct PgFdwConnpriv
 {
 	ForeignScanState *current_owner;	/* The node currently running a query
 										 * on this connection*/
-} PgFdwConnspecate;
+} PgFdwConnpriv;
 
 /* Execution state base type */
 typedef struct PgFdwState
 {
 	PGconn	   *conn;			/* connection for the scan */
-	PgFdwConnspecate *connspec;	/* connection private memory */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
 } PgFdwState;
 
 /*
@@ -385,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-						    PendingAsyncRequest *areq,
-						    bool reinit);
+							PendingAsyncRequest *areq,
+							bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -1370,9 +1370,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * establish new connection if necessary.
 	 */
 	fsstate->s.conn = GetConnection(user, false);
-	fsstate->s.connspec = (PgFdwConnspecate *)
-		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
-	fsstate->s.connspec->current_owner = NULL;
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
 	fsstate->waiter = NULL;
 	fsstate->last_waiter = node;
 
@@ -1450,7 +1450,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 		ForeignScanState *next_conn_owner = node;
 
 		/* This node has sent a query on this connection */
-		if (fsstate->s.connspec->current_owner == node)
+		if (fsstate->s.connpriv->current_owner == node)
 		{
 			/* Check if the result is available */
 			if (PQisBusy(fsstate->s.conn))
@@ -1498,7 +1498,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 				fsstate->last_waiter = node;
 			}
 		}
-		else if (fsstate->s.connspec->current_owner)
+		else if (fsstate->s.connpriv->current_owner)
 		{
 			/*
 			 * Anyone else is holding this connection. Add myself to the tail
@@ -1507,7 +1507,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 			 * shortcut to the last waiter.
 			 */
 			PgFdwScanState *conn_owner_state =
-				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
 			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
 			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
 
@@ -1523,11 +1523,13 @@ postgresIterateForeignScan(ForeignScanState *node)
 			return ExecClearTuple(slot);
 		}
 
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
 		/*
 		 * Send the next request for the next owner of this connection if
 		 * needed.
 		 */
-
 		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
 		{
 			PgFdwScanState *next_owner_state =
@@ -1869,8 +1871,8 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 
 	/* Open connection; report that we'll create a prepared statement. */
 	fmstate->s.conn = GetConnection(user, true);
-	fmstate->s.connspec = (PgFdwConnspecate *)
-		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -2472,8 +2474,8 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * establish new connection if necessary.
 	 */
 	dmstate->s.conn = GetConnection(user, false);
-	dmstate->s.connspec = (PgFdwConnspecate *)
-		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2696,7 +2698,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
-		PgFdwConnspecate *connspec;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2740,13 +2742,13 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
-		connspec = GetConnectionSpecificStorage(fpinfo->user,
-												sizeof(PgFdwConnspecate));
-		if (connspec)
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
 		{
 			PgFdwState tmpstate;
 			tmpstate.conn = conn;
-			tmpstate.connspec = connspec;
+			tmpstate.connpriv = connpriv;
 			vacate_connection(&tmpstate);
 		}
 
@@ -3181,7 +3183,7 @@ request_more_data(ForeignScanState *node)
 	char		sql[64];
 
 	/* The connection should be vacant */
-	Assert(fsstate->s.connspec->current_owner == NULL);
+	Assert(fsstate->s.connpriv->current_owner == NULL);
 
 	/*
 	 * If this is the first call after Begin or ReScan, we need to create the
@@ -3196,7 +3198,7 @@ request_more_data(ForeignScanState *node)
 	if (!PQsendQuery(conn, sql))
 		pgfdw_report_error(ERROR, NULL, conn, false, sql);
 
-	fsstate->s.connspec->current_owner = node;
+	fsstate->s.connpriv->current_owner = node;
 }
 
 /*
@@ -3210,7 +3212,7 @@ fetch_received_data(ForeignScanState *node)
 	MemoryContext oldcontext;
 
 	/* I should be the current connection owner */
-	Assert(fsstate->s.connspec->current_owner == node);
+	Assert(fsstate->s.connpriv->current_owner == node);
 
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
@@ -3287,14 +3289,14 @@ fetch_received_data(ForeignScanState *node)
 	}
 	PG_CATCH();
 	{
-		fsstate->s.connspec->current_owner = NULL;
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
-	fsstate->s.connspec->current_owner = NULL;
+	fsstate->s.connpriv->current_owner = NULL;
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -3305,16 +3307,16 @@ fetch_received_data(ForeignScanState *node)
 static void
 vacate_connection(PgFdwState *fdwstate)
 {
-	PgFdwConnspecate *connspec = fdwstate->connspec;
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
 	ForeignScanState *owner;
 
-	if (connspec == NULL || connspec->current_owner == NULL)
+	if (connpriv == NULL || connpriv->current_owner == NULL)
 		return;
 
 	/*
 	 * let the current connection owner read the result for the running query
 	 */
-	owner = connspec->current_owner;
+	owner = connpriv->current_owner;
 	fetch_received_data(owner);
 
 	/* Clear the waiting list */
@@ -3335,7 +3337,7 @@ static void
 absorb_current_result(ForeignScanState *node)
 {
 	PgFdwScanState *fsstate = GetPgFdwScanState(node);
-	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
 
 	if (owner)
 	{
@@ -3344,7 +3346,7 @@ absorb_current_result(ForeignScanState *node)
 
 		while(PQisBusy(conn))
 			PQclear(PQgetResult(conn));
-		fsstate->s.connspec->current_owner = NULL;
+		fsstate->s.connpriv->current_owner = NULL;
 		fsstate->async_waiting = false;
 	}
 }
@@ -4785,7 +4787,7 @@ postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	if (!reinit)
 		return true;
 
-	if (fsstate->s.connspec->current_owner == node)
+	if (fsstate->s.connpriv->current_owner == node)
 	{
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index a8e5f80..03ab811 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -99,14 +99,12 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/*
-	 * Give the requestee a chance to do whatever it wants.
-	 * Requst functions return true if a result is immediately available.
-	 */
+	/* Give the requestee a chance to do whatever it wants. */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -118,10 +116,8 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				(int) nodeTag(requestee));
 	}
 
-	/*
-	 * If a result is available, complete it immediately.
-	 */
-	if (areq->state == ASYNC_COMPLETE)
+	/* If a result is available, complete it immediately */
+	if (areq->state == ASYNCREQ_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
 
@@ -178,15 +174,16 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		CHECK_FOR_INTERRUPTS();
 
 		/*
-		 * Check for events, but don't block if there notifications that
-		 * have not been delivered yet.
+		 * Check for events, but don't block if any undelivered notification
+		 * remains and process the notification immediately.
 		 */
 		if (estate->es_async_callback_pending > 0)
 			ExecAsyncEventWait(estate, 0);
 		else if (!ExecAsyncEventWait(estate, cur_timeout))
 			cur_timeout = 0;			/* Timeout was reached. */
-		else
+		else if (timeout > 0)
 		{
+			/* Exited before timeout. Calculate the remaining time. */
 			instr_time      cur_time;
 			long            cur_timeout = -1;
 
@@ -205,19 +202,15 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 			if (areq->requestee->instrument)
 				InstrStartNode(areq->requestee->instrument);
 
-			/* Skip it if not pending. */
-			if (areq->state == ASYNC_CALLBACK_PENDING)
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
 			{
-				/*
-				 * Mark it as no longer needing a callback.  We must do this
-				 * before dispatching the callback in case the callback resets
-				 * the flag.
-				 */
 				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
 			}
 
-			if (areq->state == ASYNC_COMPLETE)
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -248,7 +241,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (tail->state == ASYNC_COMPLETE)
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -324,7 +319,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	{
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-		if (areq->num_fd_events > 0)
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
 			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
@@ -358,9 +353,9 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			Assert(areq->state == ASYNC_WAITING);
+			Assert(areq->state == ASYNCREQ_WAITING);
 
-			areq->state = ASYNC_CALLBACK_PENDING;
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
 			estate->es_async_callback_pending++;
 		}
 	}
@@ -377,8 +372,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(areq->state == ASYNC_WAITING);
-				areq->state = ASYNC_CALLBACK_PENDING;
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -453,11 +448,11 @@ ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
 /*
  * An executor node should call this function to signal that it needs to wait
  * on one or more file descriptor events that can be registered on a
- * WaitEventSet, and possibly also on the process latch.  num_fd_events
- * should be the maximum number of file descriptor events that it will wish to
- * register.  force_reset should be true if the node can't reuse the
- * WaitEventSet it most recently initialized, for example because it needs to
- * drop a wait event from the set.
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
  */
 void
 ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
@@ -467,7 +462,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
-	areq->state = ASYNC_WAITING;
+	areq->state = ASYNCREQ_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -497,12 +492,13 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
-	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
 
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->state = ASYNC_COMPLETE;
+	areq->state = ASYNCREQ_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 9c07b49..9293139 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -403,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->state == ASYNC_COMPLETE);
+	Assert(areq->state == ASYNCREQ_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
@@ -420,7 +420,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	if (TupIsNull(slot))
 		return;
 
-	/* Save result so we can return it. */
+	/* Set the next tuple from this requestee. */
 	Assert(node->as_nasyncresult < node->as_nasyncplans);
 	node->as_asyncresult[node->as_nasyncresult++] = slot;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5afcd34..7a62eff 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -363,10 +363,10 @@ typedef struct ResultRelInfo
  */
 typedef enum AsyncRequestState
 {
-	ASYNC_IDLE,
-	ASYNC_WAITING,
-	ASYNC_CALLBACK_PENDING,
-	ASYNC_COMPLETE
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
 } AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
-- 
2.9.2

0010-Fix-a-typo-of-mcxt.c.patchtext/x-patch; charset=us-asciiDownload

From 886e0bbf1e7742c9c34582cccb5a05575420555c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:14:15 +0900
Subject: [PATCH 10/13] Fix a typo of mcxt.c

---
 src/backend/utils/mmgr/mcxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 6ad0bb4..2e74e29 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -208,7 +208,7 @@ MemoryContextDelete(MemoryContext context)
 	MemoryContextDeleteChildren(context);
 
 	/*
-	 * It's not entirely clear whether 'tis better to do this before or after
+	 * It's not entirely clear whether it's better to do this before or after
 	 * delinking the context; but an error in a callback will likely result in
 	 * leaking the whole context (if it's not a root context) if we do it
 	 * after, so let's do it before.
-- 
2.9.2

0009-Fix-the-resource-owner-to-be-used.patchtext/x-patch; charset=us-asciiDownload

From cdad8b09c0e66507d65b1c8db552923e89d23294 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:12:40 +0900
Subject: [PATCH 09/13] Fix the resource owner to be used

Fixup of previous commit.
---
 src/backend/executor/execAsync.c | 28 +++++++---------------------
 1 file changed, 7 insertions(+), 21 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 588ba18..a8e5f80 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,7 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
-#include "utils/resowner_private.h"
+#include "utils/memutils.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -297,8 +297,6 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
-		ResourceOwner savedOwner;
-
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -308,26 +306,14 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
 
 		/*
-		 * The wait event set created here should be released in case of
-		 * error.
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
 		 */
-		savedOwner = CurrentResourceOwner;
-		CurrentResourceOwner = TopTransactionResourceOwner;
-
-		PG_TRY();
-		{
-			estate->es_wait_event_set =
-				CreateWaitEventSet(estate->es_query_cxt,
-								   estate->es_allocated_fd_events + 1);
-		}
-		PG_CATCH();
-		{
-			CurrentResourceOwner = savedOwner;
-			PG_RE_THROW();
-		}
-		PG_END_TRY();
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
 
-		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
-- 
2.9.2

0008-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 7b85a878ddef06b9dda1608eed318c176d43b575 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 08/13] Allow wait event set to be registered to resource owner

WaitEventSet may have to be released using resource owner. This change
allow the creator of a WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 14 ++++++++------
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         |  1 -
 src/include/storage/latch.h                   |  4 +++-
 5 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7939b1f..16a5d7a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 30dc77b..da2c41d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -331,7 +331,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 
 	/* This function doesn't need resowner for event set */
 	CurrentResourceOwner = NULL;
-	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
@@ -490,14 +490,14 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
-	if (CurrentResourceOwner)
-		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
 
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
@@ -558,9 +558,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
-	set->resowner = CurrentResourceOwner;
-	if (CurrentResourceOwner)
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
 		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  &MyProc->procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 34c7e37..d497216 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -1329,4 +1329,3 @@ PrintWESLeakWarning(WaitEventSet *events)
 	elog(WARNING, "wait event set leak: %p still referenced",
 		 events);
 }
-
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
-- 
2.9.2

0007-Add-instrumentation-to-async-execution.patchtext/x-patch; charset=us-asciiDownload

From 50e0e4ba3b495b85de95b2261d248679ffeb40f2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 19:04:04 +0900
Subject: [PATCH 07/13] Add instrumentation to async execution

Make explain analyze give sane result when async execution has taken
place.
---
 src/backend/executor/execAsync.c  | 19 +++++++++++++++++++
 src/backend/executor/instrument.c |  2 +-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 40e3f67..588ba18 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -46,6 +46,9 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	PendingAsyncRequest *areq = NULL;
 	int		nasync = estate->es_num_pending_async;
 
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
 	 * available slots in the es_pending_async array, expand the array.
@@ -121,11 +124,17 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	if (areq->state == ASYNC_COMPLETE)
 	{
 		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+
 		ExecAsyncResponse(estate, areq);
+		if (areq->requestee->instrument)
+			InstrStopNode(requestee->instrument,
+						  TupIsNull((TupleTableSlot*)areq->result) ? 0.0 : 1.0);
 
 		return;
 	}
 
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
 	/* No result available now, make this node pending */
 	estate->es_num_pending_async++;
 }
@@ -193,6 +202,9 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
 			/* Skip it if not pending. */
 			if (areq->state == ASYNC_CALLBACK_PENDING)
 			{
@@ -211,7 +223,14 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				if (requestor == areq->requestor)
 					requestor_done = true;
 				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
 			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
 		}
 
 		/* If any node completed, compact the array. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
-- 
2.9.2

0006-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From a7f4a6833c6eabae9c66ed1c99f15948ef91c59f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 06/13] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 568fa25..9c07b49 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -248,7 +248,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0005-Use-resource-owner-to-prevent-wait-event-set-from-le.patchtext/x-patch; charset=us-asciiDownload

From 436b22a547e66875480ad8a151ba9f1fe239dd8c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:01:56 +0900
Subject: [PATCH 05/13] Use resource owner to prevent wait event set from
 leaking

Wait event sets created for async execution can live for some
iterations so it leaks in the case of errors during the
iterations. This commit uses resource owner to prevent such leaks.
---
 src/backend/executor/execAsync.c      | 28 ++++++++++++++--
 src/backend/storage/ipc/latch.c       | 19 ++++++++++-
 src/backend/utils/resowner/resowner.c | 63 +++++++++++++++++++++++++++++++++++
 src/include/utils/resowner_private.h  |  8 +++++
 4 files changed, 114 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 33496a9..40e3f67 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -20,6 +20,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/latch.h"
+#include "utils/resowner_private.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
 static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
@@ -277,6 +278,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	if (estate->es_wait_event_set == NULL)
 	{
+		ResourceOwner savedOwner;
+
 		/*
 		 * Allow for a few extra events without reinitializing.  It
 		 * doesn't seem worth the complexity of doing anything very
@@ -284,9 +287,28 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		 * of external FDs are likely to run afoul of kernel limits anyway.
 		 */
 		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
-		estate->es_wait_event_set =
-			CreateWaitEventSet(estate->es_query_cxt,
-							   estate->es_allocated_fd_events + 1);
+
+		/*
+		 * The wait event set created here should be released in case of
+		 * error.
+		 */
+		savedOwner = CurrentResourceOwner;
+		CurrentResourceOwner = TopTransactionResourceOwner;
+
+		PG_TRY();
+		{
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_allocated_fd_events + 1);
+		}
+		PG_CATCH();
+		{
+			CurrentResourceOwner = savedOwner;
+			PG_RE_THROW();
+		}
+		PG_END_TRY();
+
+		CurrentResourceOwner = savedOwner;
 		AddWaitEventToSet(estate->es_wait_event_set,
 						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
 		reinit = true;
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 0079ba5..30dc77b 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,7 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +326,13 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set;
+	ResourceOwner savedOwner = CurrentResourceOwner;
+
+	/* This function doesn't need resowner for event set */
+	CurrentResourceOwner = NULL;
+	set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	CurrentResourceOwner = savedOwner;
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -488,6 +496,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	char	   *data;
 	Size		sz = 0;
 
+	if (CurrentResourceOwner)
+		ResourceOwnerEnlargeWESs(CurrentResourceOwner);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +558,9 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	set->resowner = CurrentResourceOwner;
+	if (CurrentResourceOwner)
+		ResourceOwnerRememberWES(set->resowner, set);
 	return set;
 }
 
@@ -582,6 +596,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..34c7e37 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,51 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/* XXXX: There's no property to identify a wait event set */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
+
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0004-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 72cd861c84a9d5bc214a58ce4c9052e52e2a2213 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 16:00:56 +0900
Subject: [PATCH 04/13] Make postgres_fdw async-capable

---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  64 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 483 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 src/backend/executor/execProcnode.c            |   9 +
 src/include/foreign/fdwapi.h                   |   2 +
 7 files changed, 510 insertions(+), 133 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..64cc057 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 9180afe..bfa2211 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | aaaa
  a        | aaaaa
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | bbb
- b        | bbbb
- b        | bbbbb
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | bbb
+ b        | bbbb
+ b        | bbbbb
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | new
- b        | new
- b        | new
  a        | aaa
  a        | zzzzzz
  a        | zzzzzz
+ b        | new
+ b        | new
+ b        | new
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- b        | newtoo
- b        | newtoo
- b        | newtoo
  a        | newtoo
  a        | newtoo
  a        | newtoo
+ b        | newtoo
+ b        | newtoo
+ b        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6431,9 +6431,9 @@ select * from bar where f1 in (select f1 from foo) for update;
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6468,9 +6468,9 @@ select * from bar where f1 in (select f1 from foo) for share;
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
+  1 | 11
   3 | 33
   4 | 44
-  1 | 11
   2 | 22
 (4 rows)
 
@@ -6733,27 +6733,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
-  2 | 322
   1 | 311
-  6 | 266
+  2 | 322
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index abb256b..a52d54a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -35,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -54,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -123,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -137,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -153,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -166,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -193,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -291,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -355,8 +385,8 @@ static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
 static bool postgresForeignAsyncConfigureWait(EState *estate,
-								  PendingAsyncRequest *areq,
-								  bool reinit);
+						    PendingAsyncRequest *areq,
+						    bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
 						   PendingAsyncRequest *areq);
 
@@ -379,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -444,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -1335,12 +1369,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1396,32 +1439,126 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is holding this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1437,7 +1574,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1445,6 +1582,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1473,9 +1613,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1493,7 +1633,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1501,16 +1641,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1712,7 +1868,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1791,6 +1949,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1801,14 +1961,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1816,10 +1976,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1857,6 +2017,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1877,14 +2039,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1892,10 +2054,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1933,6 +2095,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1953,14 +2117,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1968,10 +2132,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2018,16 +2182,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2307,7 +2471,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2360,7 +2526,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2407,8 +2576,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2527,6 +2696,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2570,6 +2740,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2924,11 +3104,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2994,47 +3174,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3044,27 +3273,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3148,7 +3432,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3158,12 +3442,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3171,9 +3455,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3304,9 +3588,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3314,10 +3598,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4463,8 +4747,10 @@ postgresIsForeignPathAsyncCapable(ForeignPath *path)
 }
 
 /*
- * XXX. Just for testing purposes, let's run everything through the async
- * mechanism but return tuples synchronously.
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
  */
 static void
 postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
@@ -4473,22 +4759,59 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
 	slot = ExecForeignScan(node);
-	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
 }
 
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
 static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
-								  bool reinit)
+						   bool reinit)
 {
-	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
 	return false;
 }
 
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
 static void
 postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
 {
-	elog(ERROR, "postgresForeignAsyncNotify");
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
 /*
@@ -4848,7 +5171,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..3f83b72 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 0dd95c6..1cba31e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -820,6 +821,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 41fc76f..11c3434 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -163,6 +163,7 @@ typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
 											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -239,6 +240,7 @@ typedef struct FdwRoutine
 	ForeignAsyncRequest_function ForeignAsyncRequest;
 	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
 	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0003-Modify-async-execution-infrastructure.patchtext/x-patch; charset=us-asciiDownload

From 52aa13caddd7c4da68784f9a4cd58dc635062ca9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 17 Oct 2016 15:54:32 +0900
Subject: [PATCH 03/13] Modify async execution infrastructure.

---
 contrib/postgres_fdw/expected/postgres_fdw.out |  68 ++++++++--------
 contrib/postgres_fdw/postgres_fdw.c            |   5 +-
 src/backend/executor/execAsync.c               | 105 ++++++++++++++-----------
 src/backend/executor/nodeAppend.c              |  50 ++++++------
 src/backend/executor/nodeForeignscan.c         |   4 +-
 src/backend/nodes/copyfuncs.c                  |   1 +
 src/backend/nodes/outfuncs.c                   |   1 +
 src/backend/nodes/readfuncs.c                  |   1 +
 src/backend/optimizer/plan/createplan.c        |  24 +++++-
 src/backend/utils/adt/ruleutils.c              |   6 +-
 src/include/executor/nodeForeignscan.h         |   2 +-
 src/include/foreign/fdwapi.h                   |   2 +-
 src/include/nodes/execnodes.h                  |  10 ++-
 src/include/nodes/plannodes.h                  |   1 +
 14 files changed, 167 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index df22beb..9180afe 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6402,13 +6402,13 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6416,10 +6416,10 @@ select * from bar where f1 in (select f1 from foo) for update;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6439,13 +6439,13 @@ select * from bar where f1 in (select f1 from foo) for update;
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                                       QUERY PLAN                                                       
-------------------------------------------------------------------------------------------------------------------------
+                                          QUERY PLAN                                          
+----------------------------------------------------------------------------------------------
  LockRows
-   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
    ->  Hash Join
-         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Append
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
@@ -6453,10 +6453,10 @@ select * from bar where f1 in (select f1 from foo) for share;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6477,22 +6477,22 @@ select * from bar where f1 in (select f1 from foo) for share;
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                               QUERY PLAN                                                
----------------------------------------------------------------------------------------------------------
+                                         QUERY PLAN                                          
+---------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar.f1 = foo2.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar.f1 = foo.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6500,16 +6500,16 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
-         Hash Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
+         Hash Cond: (bar2.f1 = foo.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                ->  HashAggregate
-                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
-                     Group Key: foo2.f1
+                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                     Group Key: foo.f1
                      ->  Append
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
@@ -6543,8 +6543,8 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
-         Hash Cond: (foo2.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
+         Hash Cond: (foo.f1 = bar.f1)
          ->  Append
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
@@ -6561,8 +6561,8 @@ where bar.f1 = ss.f1;
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
-         Merge Cond: (bar2.f1 = foo2.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
+         Merge Cond: (bar2.f1 = foo.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6570,8 +6570,8 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo2.f1)), foo2.f1
-               Sort Key: foo2.f1
+               Output: (ROW(foo.f1)), foo.f1
+               Sort Key: foo.f1
                ->  Append
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index f180838..abb256b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -354,7 +354,7 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
 static void postgresForeignAsyncRequest(EState *estate,
 							PendingAsyncRequest *areq);
-static void postgresForeignAsyncConfigureWait(EState *estate,
+static bool postgresForeignAsyncConfigureWait(EState *estate,
 								  PendingAsyncRequest *areq,
 								  bool reinit);
 static void postgresForeignAsyncNotify(EState *estate,
@@ -4477,11 +4477,12 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
-static void
+static bool
 postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 								  bool reinit)
 {
 	elog(ERROR, "postgresForeignAsyncConfigureWait");
+	return false;
 }
 
 static void
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index e070c26..33496a9 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -22,7 +22,7 @@
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
-static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 	bool reinit);
 static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
 static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
@@ -43,7 +43,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 				 PlanState *requestee)
 {
 	PendingAsyncRequest *areq = NULL;
-	int		i = estate->es_num_pending_async;
+	int		nasync = estate->es_num_pending_async;
 
 	/*
 	 * If the number of pending asynchronous nodes exceeds the number of
@@ -51,7 +51,7 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * We start with 16 slots, and thereafter double the array size each
 	 * time we run out of slots.
 	 */
-	if (i >= estate->es_max_pending_async)
+	if (nasync >= estate->es_max_pending_async)
 	{
 		int	newmax;
 
@@ -81,25 +81,28 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
 	 * one.
 	 */
-	if (estate->es_pending_async[i] == NULL)
+	if (estate->es_pending_async[nasync] == NULL)
 	{
 		areq = MemoryContextAllocZero(estate->es_query_cxt,
 									  sizeof(PendingAsyncRequest));
-		estate->es_pending_async[i] = areq;
+		estate->es_pending_async[nasync] = areq;
 	}
 	else
 	{
-		areq = estate->es_pending_async[i];
+		areq = estate->es_pending_async[nasync];
 		MemSet(areq, 0, sizeof(PendingAsyncRequest));
 	}
-	areq->myindex = estate->es_num_pending_async++;
+	areq->myindex = estate->es_num_pending_async;
 
 	/* Initialize the new request. */
 	areq->requestor = requestor;
 	areq->request_index = request_index;
 	areq->requestee = requestee;
 
-	/* Give the requestee a chance to do whatever it wants. */
+	/*
+	 * Give the requestee a chance to do whatever it wants.
+	 * Requst functions return true if a result is immediately available.
+	 */
 	switch (nodeTag(requestee))
 	{
 		case T_ForeignScanState:
@@ -110,6 +113,20 @@ ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
 			elog(ERROR, "unrecognized node type: %d",
 				(int) nodeTag(requestee));
 	}
+
+	/*
+	 * If a result is available, complete it immediately.
+	 */
+	if (areq->state == ASYNC_COMPLETE)
+	{
+		Assert(areq->result == NULL || IsA(areq->result, TupleTableSlot));
+		ExecAsyncResponse(estate, areq);
+
+		return;
+	}
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
 }
 
 /*
@@ -175,22 +192,19 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		{
 			PendingAsyncRequest *areq = estate->es_pending_async[i];
 
-			/* Skip it if no callback is pending. */
-			if (!areq->callback_pending)
-				continue;
-
-			/*
-			 * Mark it as no longer needing a callback.  We must do this
-			 * before dispatching the callback in case the callback resets
-			 * the flag.
-			 */
-			areq->callback_pending = false;
-			estate->es_async_callback_pending--;
-
-			/* Perform the actual callback; set request_done if appropraite. */
-			if (!areq->request_complete)
+			/* Skip it if not pending. */
+			if (areq->state == ASYNC_CALLBACK_PENDING)
+			{
+				/*
+				 * Mark it as no longer needing a callback.  We must do this
+				 * before dispatching the callback in case the callback resets
+				 * the flag.
+				 */
+				estate->es_async_callback_pending--;
 				ExecAsyncNotify(estate, areq);
-			else
+			}
+
+			if (areq->state == ASYNC_COMPLETE)
 			{
 				any_node_done = true;
 				if (requestor == areq->requestor)
@@ -214,7 +228,7 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 				PendingAsyncRequest *head;
 				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
 
-				if (!tail->callback_pending && tail->request_complete)
+				if (tail->state == ASYNC_COMPLETE)
 					continue;
 				head = estate->es_pending_async[hidx];
 				estate->es_pending_async[tidx] = head;
@@ -247,7 +261,8 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
  * means wait forever, 0 means don't wait at all, and >0 means wait for the
  * indicated number of milliseconds.
  *
- * Returns true if we found some events and false if we timed out.
+ * Returns true if we found some events and false if we timed out or there's
+ * no event to wait. The latter is occur when the areq is processed during
  */
 static bool
 ExecAsyncEventWait(EState *estate, long timeout)
@@ -258,6 +273,7 @@ ExecAsyncEventWait(EState *estate, long timeout)
 	int		n;
 	bool	reinit = false;
 	bool	process_latch_set = false;
+	bool	added = false;
 
 	if (estate->es_wait_event_set == NULL)
 	{
@@ -282,13 +298,16 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		PendingAsyncRequest *areq = estate->es_pending_async[i];
 
 		if (areq->num_fd_events > 0)
-			ExecAsyncConfigureWait(estate, areq, reinit);
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
 	}
 
+	Assert(added);
+
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
 								 occurred_event, EVENT_BUFFER_SIZE,
 								 WAIT_EVENT_ASYNC_WAIT);
+
 	if (noccurred == 0)
 		return false;
 
@@ -312,12 +331,10 @@ ExecAsyncEventWait(EState *estate, long timeout)
 		{
 			PendingAsyncRequest *areq = w->user_data;
 
-			if (!areq->callback_pending)
-			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
-				estate->es_async_callback_pending++;
-			}
+			Assert(areq->state == ASYNC_WAITING);
+
+			areq->state = ASYNC_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
 		}
 	}
 
@@ -333,8 +350,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 			if (areq->wants_process_latch)
 			{
-				Assert(!areq->request_complete);
-				areq->callback_pending = true;
+				Assert(areq->state == ASYNC_WAITING);
+				areq->state = ASYNC_CALLBACK_PENDING;
 			}
 		}
 	}
@@ -352,15 +369,19 @@ ExecAsyncEventWait(EState *estate, long timeout)
  * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
  * and the number of calls should not exceed areq->num_fd_events (as
  * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
  */
-static void
+static bool
 ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
 					   bool reinit)
 {
 	switch (nodeTag(areq->requestee))
 	{
 		case T_ForeignScanState:
-			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
 			break;
 		default:
 			elog(ERROR, "unrecognized node type: %d",
@@ -419,6 +440,7 @@ ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
 	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
 	areq->num_fd_events = num_fd_events;
 	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNC_WAITING;
 
 	if (force_reset && estate->es_wait_event_set != NULL)
 	{
@@ -448,17 +470,12 @@ ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
 	 * need a callback to remove registered wait events.  It's not clear
 	 * that we would come out ahead, so use brute force for now.
 	 */
+	Assert(areq->state == ASYNC_IDLE || areq->state == ASYNC_CALLBACK_PENDING);
+
 	if (areq->num_fd_events > 0 || areq->wants_process_latch)
 		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
 
 	/* Save result and mark request as complete. */
 	areq->result = result;
-	areq->request_complete = true;
-
-	/* Make sure this request is flagged for a callback. */
-	if (!areq->callback_pending)
-	{
-		areq->callback_pending = true;
-		estate->es_async_callback_pending++;
-	}
+	areq->state = ASYNC_COMPLETE;
 }
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e61218a..568fa25 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -229,9 +229,15 @@ ExecAppend(AppendState *node)
 		 */
 		while ((i = bms_first_member(node->as_needrequest)) >= 0)
 		{
-			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
 			node->as_nasyncpending++;
+
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			/* If this request immediately gives a result, take it. */
+			if (node->as_nasyncresult > 0)
+				return node->as_asyncresult[--node->as_nasyncresult];
 		}
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
 	for (;;)
@@ -246,32 +252,32 @@ ExecAppend(AppendState *node)
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-			for (;;)
+			while (node->as_nasyncpending > 0)
 			{
-				if (node->as_nasyncpending == 0)
-				{
-					/*
-					 * If there is no asynchronous activity still pending
-					 * and the synchronous activity is also complete, we're
-					 * totally done scanning this node.  Otherwise, we're
-					 * done with the asynchronous stuff but must continue
-					 * scanning the synchronous children.
-					 */
-					if (node->as_syncdone)
-						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-					break;
-				}
-				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
-				{
-					/* Timeout reached. */
-					break;
-				}
-				if (node->as_nasyncresult > 0)
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
 				{
 					/* Asynchronous subplan returned a tuple! */
 					--node->as_nasyncresult;
 					return node->as_asyncresult[node->as_nasyncresult];
 				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 			}
 		}
 
@@ -397,7 +403,7 @@ ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	/* We shouldn't be called until the request is complete. */
-	Assert(areq->request_complete);
+	Assert(areq->state == ASYNC_COMPLETE);
 
 	/* Our result slot shouldn't already be occupied. */
 	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 61899d1..85dad79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -376,7 +376,7 @@ ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
  *		In async mode, configure for a wait
  * ----------------------------------------------------------------
  */
-void
+bool
 ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit)
 {
@@ -384,7 +384,7 @@ ExecAsyncForeignScanConfigureWait(EState *estate,
 	FdwRoutine *fdwroutine = node->fdwroutine;
 
 	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
-	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a8cabdf..c62aaf2 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -237,6 +237,7 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a894a9d..c2e34a8 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -370,6 +370,7 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 67439ec..9837eff 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1540,6 +1540,7 @@ _readAppend(void)
 
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 2140094..0575541 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -966,6 +967,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -991,7 +994,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1005,9 +1015,13 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		{
 			asyncplans = lappend(asyncplans, subplan);
 			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
 		}
 		else
 			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1017,7 +1031,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5019,7 +5034,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int nasyncplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5030,6 +5045,7 @@ make_append(List *appendplans, int nasyncplans, List *tlist)
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
 	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index f355954..76dd07a 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4242,7 +4242,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 5a61306..2d9a62b 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -31,7 +31,7 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 
 extern void ExecAsyncForeignScanRequest(EState *estate,
 	PendingAsyncRequest *areq);
-extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
 	PendingAsyncRequest *areq, bool reinit);
 extern void ExecAsyncForeignScanNotify(EState *estate,
 	PendingAsyncRequest *areq);
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 4c50f1e..41fc76f 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -158,7 +158,7 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
 typedef void (*ForeignAsyncRequest_function) (EState *estate,
 											PendingAsyncRequest *areq);
-typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
 											PendingAsyncRequest *areq,
 											bool reinit);
 typedef void (*ForeignAsyncNotify_function) (EState *estate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 81e997e..5afcd34 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -361,6 +361,13 @@ typedef struct ResultRelInfo
  * State for an asynchronous tuple request.
  * ----------------
  */
+typedef enum AsyncRequestState
+{
+	ASYNC_IDLE,
+	ASYNC_WAITING,
+	ASYNC_CALLBACK_PENDING,
+	ASYNC_COMPLETE
+} AsyncRequestState;
 typedef struct PendingAsyncRequest
 {
 	int			myindex;			/* Index in es_pending_async. */
@@ -369,8 +376,7 @@ typedef struct PendingAsyncRequest
 	int			request_index;	/* Scratch space for requestor. */
 	int			num_fd_events;	/* Max number of FD events requestee needs. */
 	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
-	bool		callback_pending;			/* Callback is needed. */
-	bool		request_complete;		/* Request complete, result valid. */
+	AsyncRequestState state;
 	Node	   *result;			/* Result (NULL if no more tuples). */
 } PendingAsyncRequest;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f0daada..ebbc78d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -229,6 +229,7 @@ typedef struct Append
 	Plan		plan;
 	List	   *appendplans;
 	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
-- 
2.9.2

0002-Fix-some-bugs.patchtext/x-patch; charset=us-asciiDownload

From 9af6f95965adf04e713235f541919158512ae994 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 14:03:53 +0900
Subject: [PATCH 02/13] Fix some bugs.

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 142 ++++++++++++-------------
 contrib/postgres_fdw/postgres_fdw.c            |   3 +-
 src/backend/executor/execAsync.c               |   4 +-
 src/backend/postmaster/pgstat.c                |   3 +
 src/include/pgstat.h                           |   3 +-
 5 files changed, 81 insertions(+), 74 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..df22beb 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6254,12 +6254,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6282,12 +6282,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6310,12 +6310,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6338,12 +6338,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6402,120 +6402,120 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 explain (verbose, costs off)
 select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+                                                       QUERY PLAN                                                       
+------------------------------------------------------------------------------------------------------------------------
  LockRows
-   Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
+   Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
    ->  Hash Join
-         Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar2.f1, bar2.f2, bar2.ctid, ((bar2.*)::bar), bar2.tableoid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Append
-               ->  Seq Scan on public.bar
-                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
+               ->  Seq Scan on public.bar
+                     Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (22 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
 ----+----
-  1 | 11
-  2 | 22
   3 | 33
   4 | 44
+  1 | 11
+  2 | 22
 (4 rows)
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
-                                         QUERY PLAN                                          
----------------------------------------------------------------------------------------------
+                                               QUERY PLAN                                                
+---------------------------------------------------------------------------------------------------------
  Update on public.bar
    Update on public.bar
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar.f1 = foo2.f1)
          ->  Seq Scan on public.bar
                Output: bar.f1, bar.f2, bar.ctid
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo2.ctid, ((foo2.*)::foo), foo2.tableoid
+         Hash Cond: (bar2.f1 = foo2.f1)
          ->  Foreign Scan on public.bar2
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Hash
-               Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
                ->  HashAggregate
-                     Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                     Group Key: foo.f1
+                     Output: foo2.ctid, ((foo2.*)::foo), foo2.tableoid, foo2.f1
+                     Group Key: foo2.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6543,26 +6543,26 @@ where bar.f1 = ss.f1;
    Foreign Update on public.bar2
      Remote SQL: UPDATE public.loct2 SET f2 = $2 WHERE ctid = $1
    ->  Hash Join
-         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
-         Hash Cond: (foo.f1 = bar.f1)
+         Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo2.f1))
+         Hash Cond: (foo2.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
                      Output: bar.f1, bar.f2, bar.ctid
    ->  Merge Join
-         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo.f1))
-         Merge Cond: (bar2.f1 = foo.f1)
+         Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, (ROW(foo2.f1))
+         Merge Cond: (bar2.f1 = foo2.f1)
          ->  Sort
                Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                Sort Key: bar2.f1
@@ -6570,19 +6570,19 @@ where bar.f1 = ss.f1;
                      Output: bar2.f1, bar2.f2, bar2.f3, bar2.ctid
                      Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
          ->  Sort
-               Output: (ROW(foo.f1)), foo.f1
-               Sort Key: foo.f1
+               Output: (ROW(foo2.f1)), foo2.f1
+               Sort Key: foo2.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6749,8 +6749,8 @@ update bar set f2 = f2 + 100 returning *;
 update bar set f2 = f2 + 100 returning *;
  f1 | f2  
 ----+-----
-  1 | 311
   2 | 322
+  1 | 311
   6 | 266
   3 | 333
   4 | 344
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 595a47e..f180838 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,7 @@
 #include "commands/explain.h"
 #include "commands/vacuum.h"
 #include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -4472,7 +4473,7 @@ postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
 	TupleTableSlot *slot;
 
 	Assert(IsA(node, ForeignScanState));
-	slot = postgresIterateForeignScan(node);
+	slot = ExecForeignScan(node);
 	ExecAsyncRequestDone(estate, areq, (Node *) slot);
 }
 
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 5858bb5..e070c26 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -18,6 +18,7 @@
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
 #include "miscadmin.h"
+#include "pgstat.h"
 #include "storage/latch.h"
 
 static bool ExecAsyncEventWait(EState *estate, long timeout);
@@ -286,7 +287,8 @@ ExecAsyncEventWait(EState *estate, long timeout)
 
 	/* Wait for at least one event to occur. */
 	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
-								 occurred_event, EVENT_BUFFER_SIZE);
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
 	if (noccurred == 0)
 		return false;
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7176cf1..af59f51 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3398,6 +3398,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index de8225b..7769d3c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -787,7 +787,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0001-robert-s-2nd-framework.patchtext/x-patch; charset=us-asciiDownload

From 6ae1a77eaa324fe4455840ddbeb734bd12bc4ede Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 12 Oct 2016 12:46:10 +0900
Subject: [PATCH 01/13] robert's 2nd framework

---
 contrib/postgres_fdw/postgres_fdw.c     |  49 ++++
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  43 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 462 ++++++++++++++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 162 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 ++++
 src/backend/nodes/copyfuncs.c           |   1 +
 src/backend/nodes/outfuncs.c            |   1 +
 src/backend/nodes/readfuncs.c           |   1 +
 src/backend/optimizer/plan/createplan.c |  45 +++-
 src/include/executor/execAsync.h        |  29 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  15 ++
 src/include/nodes/execnodes.h           |  57 +++-
 src/include/nodes/plannodes.h           |   1 +
 17 files changed, 909 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..595a47e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -349,6 +350,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static void postgresForeignAsyncConfigureWait(EState *estate,
+								  PendingAsyncRequest *areq,
+								  bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -468,6 +477,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -4440,6 +4455,40 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * XXX. Just for testing purposes, let's run everything through the async
+ * mechanism but return tuples synchronously.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = postgresIterateForeignScan(node);
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
+static void
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+								  bool reinit)
+{
+	elog(ERROR, "postgresForeignAsyncConfigureWait");
+}
+
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	elog(ERROR, "postgresForeignAsyncNotify");
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 2a2b7eb..dd05d1e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+	   execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execReplication.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..1dee3db 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,46 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from an
+async-capable child node using ExecAsyncRequest.  Next, it must execute
+the asynchronous event loop using ExecAsyncEventLoop; it can avoid giving
+up control indefinitely by passing a timeout to this function, even passing
+-1 to poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting node
+will receive a callback from the event loop via ExecAsyncResponse. Typically,
+the ExecAsyncResponse callback is the only one required for nodes that wish
+to request tuples asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index d380207..e154c59 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -468,11 +468,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..5858bb5
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,462 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "storage/latch.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static void ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		i = estate->es_num_pending_async;
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (i >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[i] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[i] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[i];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async++;
+
+	/* Initialize the new request. */
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/*
+		 * Check for events, but don't block if there notifications that
+		 * have not been delivered yet.
+		 */
+		if (estate->es_async_callback_pending > 0)
+			ExecAsyncEventWait(estate, 0);
+		else if (!ExecAsyncEventWait(estate, cur_timeout))
+			cur_timeout = 0;			/* Timeout was reached. */
+		else
+		{
+			instr_time      cur_time;
+			long            cur_timeout = -1;
+
+			INSTR_TIME_SET_CURRENT(cur_time);
+			INSTR_TIME_SUBTRACT(cur_time, start_time);
+			cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+			if (cur_timeout < 0)
+				cur_timeout = 0;
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			/* Skip it if no callback is pending. */
+			if (!areq->callback_pending)
+				continue;
+
+			/*
+			 * Mark it as no longer needing a callback.  We must do this
+			 * before dispatching the callback in case the callback resets
+			 * the flag.
+			 */
+			areq->callback_pending = false;
+			estate->es_async_callback_pending--;
+
+			/* Perform the actual callback; set request_done if appropraite. */
+			if (!areq->request_complete)
+				ExecAsyncNotify(estate, areq);
+			else
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+			}
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				if (!tail->callback_pending && tail->request_complete)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns true if we found some events and false if we timed out.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+		estate->es_wait_event_set =
+			CreateWaitEventSet(estate->es_query_cxt,
+							   estate->es_allocated_fd_events + 1);
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0)
+			ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE);
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			if (!areq->callback_pending)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+				estate->es_async_callback_pending++;
+			}
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(!areq->request_complete);
+				areq->callback_pending = true;
+			}
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ */
+static void
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on the process latch.  num_fd_events
+ * should be the maximum number of file descriptor events that it will wish to
+ * register.  force_reset should be true if the node can't reuse the
+ * WaitEventSet it most recently initialized, for example because it needs to
+ * drop a wait event from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->request_complete = true;
+
+	/* Make sure this request is flagged for a callback. */
+	if (!areq->callback_pending)
+	{
+		areq->callback_pending = true;
+		estate->es_async_callback_pending++;
+	}
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..e61218a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,78 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		/*
+		 * If there are any nodes that need a new asynchronous request,
+		 * make all of them.
+		 */
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+			node->as_nasyncpending++;
+		}
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			for (;;)
+			{
+				if (node->as_nasyncpending == 0)
+				{
+					/*
+					 * If there is no asynchronous activity still pending
+					 * and the synchronous activity is also complete, we're
+					 * totally done scanning this node.  Otherwise, we're
+					 * done with the asynchronous stuff but must continue
+					 * scanning the synchronous children.
+					 */
+					if (node->as_syncdone)
+						return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+					break;
+				}
+				if (!ExecAsyncEventLoop(node->ps.state, &node->ps, timeout))
+				{
+					/* Timeout reached. */
+					break;
+				}
+				if (node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
+		 */
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +299,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +352,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +380,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->request_complete);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* Request is no longer pending. */
+	Assert(node->as_nasyncpending > 0);
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 86a77e3..61899d1 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -353,3 +353,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 30d733e..a8cabdf 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,7 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1560ac3..a894a9d 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,7 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index dcfa6ee..67439ec 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1539,6 +1539,7 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 997bdcf..2140094 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,7 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -272,6 +272,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -961,8 +962,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -997,7 +1000,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
 	}
 
 	/*
@@ -1007,7 +1017,7 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5009,7 +5019,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5019,6 +5029,7 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
 
 	return node;
 }
@@ -6340,3 +6351,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..2abc32d
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,29 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index f0e942a..5a61306 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern void ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..4c50f1e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,15 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +233,12 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 42c6c58..81e997e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -356,6 +356,25 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	bool		callback_pending;			/* Callback is needed. */
+	bool		request_complete;		/* Request complete, result valid. */
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -434,6 +453,31 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area   *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_callback_pending is the number of PendingAsyncRequests for
+	 * which callback_pending is true.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;
+	int			es_max_pending_async;
+	int			es_async_callback_pending;
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1179,17 +1223,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f72f7a8..f0daada 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,7 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
 } Append;
 
 /* ----------------
-- 
2.9.2

#24

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#23)

4 attachment(s)

Hello, I totally reorganized the patch set to four pathces on the
current master (9e43e87).

At Wed, 22 Feb 2017 17:39:45 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170222.173945.262776579.horiguchi.kyotaro@lab.ntt.co.jp>

Finally, I couldn't see the crash for the (maybe) same case. I
can guess two reasons for this. One is that a situation where
node->as_nasyncpending differs from estate->es_num_pending_async,
but I couldn't find a possibility. Another is a situation in
postgresIterateForeignScan where the "next owner" reaches eof but
another waiter is not. I haven't reproduce the situation but
fixed it for the case. Addition to that I found a bug in
ExecAsyncAppendResponse. It calls bms_add_member inappropriate
way.

This found to be wrong. The true problem here was (maybe) that
ExecAsyncRequest can complete a tuple immediately. This causes
multiple calling to ExecAsyncRequest for the same child at
once. (For the case, the processing node is added again to
node->as_needrequest before ExecAsyncRequest returns.)

Using a copy of node->as_needrequest will fix this but it is
uneasy so I changed ExecAsyncRequest not to return a tuple
immediately. Instaed, ExecAsyncEventLoop skips waiting if no node
to wait. The tuple previously "response"'ed in ExecAsyncRequest
is now responsed here.

Addition to that, the current policy of preserving of
es_wait_event_set doesn't seem to work with the async-capable
postgres_fdw. So the current code cleares it at every entering to
ExecAppend. This needs more thoughts.

I measured the performance of async-execution and it was quite
good from the previous version especially for single-connection
environment.

pf0: 4 foreign tables on single connection
non async : (prev) 7928ms -> (this time)7993ms
async : (prev) 6447ms -> (this time)3211ms

pf1: 4 foreign tables on dedicate connection for every table
non async : (prev) 8023ms -> (this time)7953ms
async : (prev) 1876ms -> (this time)1841ms

Boost rate by async execution is 60% for single connectsion and
77% for dedicate connection environment.

Mmm, I reproduces it quite easily. A silly bug.

Something bad is happening between freeing ExecutorState memory
context and resource owner. Perhaps the ExecutorState is freed by
resowner (as a part of its anscestors) before the memory for the
WaitEventSet is freed. It was careless of me. I'll reconsider it.

The cause was that the WaitEventSet was placed in ExecutorState
but registered to TopTransactionResourceOwner. I fixed it.

The attached patches are the following.

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
Allows WaitEventSet to released by resource owner.

0002-Asynchronous-execution-framework.patch
Asynchronous execution framework based on Robert's version. All
edits on this is merged.

0003-Make-postgres_fdw-async-capable.patch
Make postgres_fdw async-capable.

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patch

This can be merged to 0002 but I didn't since the usage of
using these pragmas is arguable.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From bcd888a98a7aa5e1bd367c83e06d598121fd2d94 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 7939b1f..16a5d7a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 0079ba5..a204b0c 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -62,6 +62,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -90,6 +91,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -324,7 +327,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -482,12 +485,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -547,6 +553,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -582,6 +593,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  &MyProc->procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From 2f90dd114467c5da10b8e3bdaa20ccef47052a15 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  45 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 520 ++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c     |   9 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 +++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  63 +++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/execAsync.h        |  30 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  17 ++
 src/include/nodes/execnodes.h           |  65 +++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 979 insertions(+), 29 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 2a2b7eb..dd05d1e 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+	   execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execReplication.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..7bd009c 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index d380207..e154c59 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -468,11 +468,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		nasync = estate->es_num_pending_async;
+
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (nasync >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[nasync] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[nasync] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[nasync];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async;
+
+	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
+
+	return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	Assert(estate->es_async_callback_pending == 0);
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for events only if any node is async-not-ready. */
+		if (estate->es_num_async_ready < estate->es_num_pending_async)
+		{
+			/* Don't block if any tuple available. */
+			if (estate->es_async_callback_pending > 0)
+				ExecAsyncEventWait(estate, 0);
+			else if (!ExecAsyncEventWait(estate, cur_timeout))
+			{	/* Not fired */
+				/* Exited before timeout. Calculate the remaining time. */
+				instr_time      cur_time;
+				long            cur_timeout = -1;
+
+				/* Wait forever  */
+				if (timeout < 0)
+					continue;
+
+				INSTR_TIME_SET_CURRENT(cur_time);
+				INSTR_TIME_SUBTRACT(cur_time, start_time);
+				cur_timeout =
+					timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+				if (cur_timeout > 0)
+					continue;
+			}
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+				ExecAsyncNotify(estate, areq);
+
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
+			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+	bool	added = false;
+	bool	fired = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+		/*
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
+		 */
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
+
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/*
+	 * We may have no event to wait. This occurs when all nodes that
+	 * is asynchronously executing have tuples immediately available.
+	 */
+	if (!added)
+		return true;
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			Assert(areq->state == ASYNCREQ_WAITING);
+
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
+			fired = true;
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
+				estate->es_async_callback_pending++;
+				fired = true;
+			}
+		}
+	}
+
+	return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+
+	estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+	estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNCREQ_WAITING;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+		ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->state = ASYNCREQ_COMPLETE;
+	estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+	if (estate->es_wait_event_set == NULL)
+		return;
+
+	FreeWaitEventSet(estate->es_wait_event_set);
+	estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5ccc2e8..88f823d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -822,6 +823,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+		{
+			ForeignScanState *fsstate = (ForeignScanState *)node;
+			FdwRoutine *fdwroutine = fsstate->fdwroutine;
+			if (fdwroutine->ShutdownForeignScan)
+				fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+		}
+		break;
 		default:
 			break;
 	}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..12d3742 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +306,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +359,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +387,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 86a77e3..85dad79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -353,3 +353,52 @@ ExecForeignScanInitializeWorker(ForeignScanState *node, shm_toc *toc)
 		fdwroutine->InitializeWorkerForeignScan(node, toc, coordinate);
 	}
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 05d8538..e64ec77 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,8 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b3802b4..8b39efa 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,8 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d2f69fe..d5d3c81 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1539,6 +1539,8 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 1e953b4..72080cb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -194,7 +194,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -272,6 +273,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *resultRelations, List *subplans,
 				 List *withCheckOptionLists, List *returningLists,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -960,8 +962,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -987,7 +993,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -996,7 +1009,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1006,7 +1030,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5003,7 +5028,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5013,6 +5038,8 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6334,3 +6361,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index ada374c..a0ec3b7 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3401,6 +3401,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index b27b77d..c43e8b2 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4240,7 +4240,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index f0e942a..2d9a62b 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -29,4 +29,11 @@ extern void ExecForeignScanInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 523d415..11c3434 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -155,6 +155,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -224,6 +234,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6332ea0..8445d79 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -356,6 +356,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -435,6 +461,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area   *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1180,17 +1232,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f72f7a8..ebbc78d 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,8 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 8b710ec..6c94a75 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -788,7 +788,8 @@ typedef enum
 	WAIT_EVENT_MQ_SEND,
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From bd740f884446b60847c579a6a4c16c7b2d16cf90 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out | 120 +++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  12 +-
 5 files changed, 583 insertions(+), 152 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 7f7a744..64cc057 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..90691e5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6401,34 +6401,39 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6438,34 +6443,39 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6494,11 +6504,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Hash Cond: (bar2.f1 = foo.f1)
@@ -6511,11 +6521,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6546,16 +6556,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6573,16 +6583,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6733,27 +6743,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 5d270b9..76e8437 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -33,6 +35,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -52,6 +55,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -121,10 +127,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -135,7 +158,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -151,6 +174,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -164,11 +194,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -191,6 +221,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -289,6 +320,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -349,6 +381,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -369,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -434,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -468,6 +512,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1319,12 +1369,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1380,32 +1439,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1421,7 +1578,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1429,6 +1586,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1457,9 +1617,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1477,7 +1637,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1485,16 +1645,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1696,7 +1872,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1775,6 +1953,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1785,14 +1965,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1800,10 +1980,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1841,6 +2021,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1861,14 +2043,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1876,10 +2058,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1917,6 +2099,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1937,14 +2121,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1952,10 +2136,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2002,16 +2186,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2291,7 +2475,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2344,7 +2530,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2391,8 +2580,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2511,6 +2700,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2554,6 +2744,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2908,11 +3108,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2978,47 +3178,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3028,27 +3277,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3132,7 +3436,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3142,12 +3446,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3155,9 +3459,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3288,9 +3592,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3298,10 +3602,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4440,6 +4744,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4797,7 +5175,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..4dca0c4 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1511,12 +1511,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From b50c350b8392b6c7621cb93c863470a07f5bb563 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 12d3742..f44c40a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -249,7 +249,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

#25

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#24)

On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

9e43e87

Patch fails on current master, but correctly applies to 9e43e87. Thanks for
including the commit id.

Regression tests pass.

As with my last attempt at reviewing this patch, I'm confused about what
kind of queries can take advantage of this patch. Is it only cases where a
local table has multiple inherited foreign table children? Will it work
with queries where two foreign tables are referenced and combined with a
UNION ALL?

#26

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Corey Huinker (#25)

On 2017/03/11 8:19, Corey Huinker wrote:

On Thu, Feb 23, 2017 at 6:59 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp <mailto:horiguchi.kyotaro@lab.ntt.co.jp>>
wrote:

9e43e87

Patch fails on current master, but correctly applies to 9e43e87. Thanks
for including the commit id.

Regression tests pass.

As with my last attempt at reviewing this patch, I'm confused about what
kind of queries can take advantage of this patch. Is it only cases where a
local table has multiple inherited foreign table children?

IIUC, Horiguchi-san's patch adds asynchronous capability for ForeignScan's
driven by postgres_fdw (after building some relevant infrastructure
first). The same might be added to other Scan nodes (and probably other
nodes as well) eventually so that more queries will benefit from
asynchronous execution. It may just be that ForeignScan's benefit more
from asynchronous execution than other Scan types.

Will it work
with queries where two foreign tables are referenced and combined with a
UNION ALL?

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Amit Langote (#26)

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.

Ok, I'll re-run my test from a few weeks back and see if anything has
changed.

#28

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Corey Huinker (#27)

On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com>
wrote:

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.

Ok, I'll re-run my test from a few weeks back and see if anything has
changed.

I'm not able to discern any difference in plan between a 9.6 instance and
this patch.

The basic outline of my test is:

EXPLAIN ANALYZE
SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'

I've tried this test where tab1 through tab4 all are the same postgres_fdw
foreign table.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing to the same remote table sharing a the same server
definition.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing each with it's own foreign server definition, all of which
happen to point to the same remote cluster.

Are there some postgresql.conf settings I should set to get a decent test?

#29

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Corey Huinker (#28)

On 2017/03/14 6:31, Corey Huinker wrote:

On Mon, Mar 13, 2017 at 1:06 AM, Corey Huinker <corey.huinker@gmail.com>
wrote:

I think it will, because Append itself has been made async-capable by one
of the patches and UNION ALL uses Append. But as mentioned above, only
the postgres_fdw foreign tables will be able to utilize this for now.

Ok, I'll re-run my test from a few weeks back and see if anything has
changed.

I'm not able to discern any difference in plan between a 9.6 instance and
this patch.

The basic outline of my test is:

EXPLAIN ANALYZE
SELECT c1, c2, ..., cN FROM tab1 WHERE date = '1 day ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab2 WHERE date = '2 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab3 WHERE date = '3 days ago'
UNION ALL
SELECT c1, c2, ..., cN FROM tab4 WHERE date = '4 days ago'

I've tried this test where tab1 through tab4 all are the same postgres_fdw
foreign table.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing to the same remote table sharing a the same server
definition.
I've tried this test where tab1 through tab4 all are different foreign
tables pointing each with it's own foreign server definition, all of which
happen to point to the same remote cluster.

Are there some postgresql.conf settings I should set to get a decent test?

I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Amit Langote (#29)

I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.

Thanks,
Amit

I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
instance given the same script. I must be missing something.

#31

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 9 years ago

In reply to: Corey Huinker (#30)

On 2017/03/14 10:08, Corey Huinker wrote:

I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance improvement.

I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a 9.6
instance given the same script. I must be missing something.

Hmm, maybe I'm missing something too.

Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:

/messages/by-id/20161018.103051.30820907.horiguchi.kyotaro@lab.ntt.co.jp

From that message:

<quote>
I measured performance and had the following result.

The result is written as "time<ms> (std dev <ms>)"

sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)

async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>

IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Amit Langote (#31)

On Mon, Mar 13, 2017 at 9:28 PM, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp

wrote:

On 2017/03/14 10:08, Corey Huinker wrote:

I don't think the plan itself will change as a result of applying this
patch. You might however be able to observe some performance

improvement.

I could see no performance improvement, even with 16 separate queries
combined with UNION ALL. Query performance was always with +/- 10% of a

9.6

instance given the same script. I must be missing something.

Hmm, maybe I'm missing something too.

Anyway, here is an older message on this thread from Horiguchi-san where
he shared some of the test cases that this patch improves performance for:

/messages/by-id/20161018.103051.
30820907.horiguchi.kyotaro%40lab.ntt.co.jp

From that message:

<quote>
I measured performance and had the following result.

t0 - SELECT sum(a) FROM <local single table>;
pl - SELECT sum(a) FROM <4 local children>;
pf0 - SELECT sum(a) FROM <4 foreign children on single connection>;
pf1 - SELECT sum(a) FROM <4 foreign children on dedicate connections>;

The result is written as "time<ms> (std dev <ms>)"

sync
t0: 3820.33 ( 1.88)
pl: 1608.59 ( 12.06)
pf0: 7928.29 ( 46.58)
pf1: 8023.16 ( 26.43)

async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>

IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.

Thanks,
Amit

I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But It
doesn't work when just stringing together those foreign tables with UNION
ALLs.

I don't know how to proceed with this review if that was a goal of the
patch.

#33

Tom Lane

tgl@sss.pgh.pa.us

almost 9 years ago

In reply to: Corey Huinker (#32)

Corey Huinker <corey.huinker@gmail.com> writes:

I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But It
doesn't work when just stringing together those foreign tables with UNION
ALLs.

I don't know how to proceed with this review if that was a goal of the
patch.

Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Tom Lane (#33)

On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:

I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But

It

doesn't work when just stringing together those foreign tables with UNION
ALLs.

I don't know how to proceed with this review if that was a goal of the
patch.

Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.

regards, tom lane

Updated commitfest entry to "Returned With Feedback".

#35

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Corey Huinker (#32)

4 attachment(s)

At Thu, 16 Mar 2017 17:16:32 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in <CADkLM=cBZEX9L9HnhJYrtfiAN5Ebdu=xbvM_poWVGBR7yN3gVw@mail.gmail.com>

On Thu, Mar 16, 2017 at 4:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Corey Huinker <corey.huinker@gmail.com> writes:

I reworked the test such that all of the foreign tables inherit from the
same parent table, and if you query that you do get async execution. But

It

doesn't work when just stringing together those foreign tables with UNION
ALLs.

I don't know how to proceed with this review if that was a goal of the
patch.

Whether it was a goal or not, I'd say there is something either broken
or incorrectly implemented if you don't see that. The planner (and
therefore also the executor) generally treats inheritance the same as
simple UNION ALL. If that's not the case here, I'd want to know why.

regards, tom lane

Updated commitfest entry to "Returned With Feedback".

Sorry for the absense. For information, I'll continue to write
some more.

At Tue, 14 Mar 2017 10:28:36 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <e7dc8128-f32b-ff9a-870e-f1117b8e4fa6@lab.ntt.co.jp>

async
t0: 3806.31 ( 4.49) 0.4% faster (should be error)
pl: 1629.17 ( 0.29) 1.3% slower
pf0: 6447.07 ( 25.19) 18.7% faster
pf1: 1876.80 ( 47.13) 76.6% faster
</quote>

IIUC, pf0 and pf1 is the same test case (all 4 foreign tables target the
same server) measured with different implementations of the patch.

pf0 is measured on a partitioned(sharded) tables on one foreign
server, that is, sharing a connection. pf1 is in contrast sharded
tables that have dedicate server (or connection). The parent
server is async-patched and the child server is not patched.

Async-capable plan is generated in planner. An Append contains at
least one async-capable child becomes async-aware Append. So the
async feature should be effective also for the UNION ALL case.

The following will work faster than unpatched version.I

SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40) as ft;

I'll measure the performance for the case next week.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From f049f01a92e91f4185f12f814dd90bb16d390121 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   4 +-
 src/backend/executor/README             |  45 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 520 ++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c     |   1 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 +++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  64 +++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/execAsync.h        |  30 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  17 ++
 src/include/nodes/execnodes.h           |  65 +++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 971 insertions(+), 30 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index d281906..d6c74bd 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+	   execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execReplication.o execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c..7bd009c 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -199,3 +199,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 5d59f95..ecc8eec 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -473,11 +473,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		nasync = estate->es_num_pending_async;
+
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (nasync >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[nasync] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[nasync] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[nasync];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async;
+
+	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
+
+	return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	Assert(estate->es_async_callback_pending == 0);
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for events only if any node is async-not-ready. */
+		if (estate->es_num_async_ready < estate->es_num_pending_async)
+		{
+			/* Don't block if any tuple available. */
+			if (estate->es_async_callback_pending > 0)
+				ExecAsyncEventWait(estate, 0);
+			else if (!ExecAsyncEventWait(estate, cur_timeout))
+			{	/* Not fired */
+				/* Exited before timeout. Calculate the remaining time. */
+				instr_time      cur_time;
+				long            cur_timeout = -1;
+
+				/* Wait forever  */
+				if (timeout < 0)
+					continue;
+
+				INSTR_TIME_SET_CURRENT(cur_time);
+				INSTR_TIME_SUBTRACT(cur_time, start_time);
+				cur_timeout =
+					timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+				if (cur_timeout > 0)
+					continue;
+			}
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+				ExecAsyncNotify(estate, areq);
+
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
+			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+	bool	added = false;
+	bool	fired = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+		/*
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
+		 */
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
+
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/*
+	 * We may have no event to wait. This occurs when all nodes that
+	 * is asynchronously executing have tuples immediately available.
+	 */
+	if (!added)
+		return true;
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			Assert(areq->state == ASYNCREQ_WAITING);
+
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
+			fired = true;
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
+				estate->es_async_callback_pending++;
+				fired = true;
+			}
+		}
+	}
+
+	return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+
+	estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+	estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNCREQ_WAITING;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+		ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->state = ASYNCREQ_COMPLETE;
+	estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+	if (estate->es_wait_event_set == NULL)
+		return;
+
+	FreeWaitEventSet(estate->es_wait_event_set);
+	estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 80c77ad..31222ea 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -117,6 +117,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986cae..12d3742 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -142,6 +148,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -176,9 +191,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -193,15 +208,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -221,14 +306,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -267,6 +359,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -285,6 +387,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 3b6d139..0a46f5f 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -369,3 +369,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 25fd051..7b548c0 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -236,6 +236,8 @@ _copyAppend(const Append *from)
 	 * copy remainder of node
 	 */
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 7418fbe..688d197 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -369,6 +369,8 @@ _outAppend(StringInfo str, const Append *node)
 	_outPlanInfo(str, (const Plan *) node);
 
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d3bbc02..7cb9d2f 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1565,6 +1565,8 @@ _readAppend(void)
 	ReadCommonPlan(&local_node->plan);
 
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 89e1946..14b46ef 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -199,7 +199,8 @@ static CteScan *make_ctescan(List *qptlist, List *qpqual,
 			 Index scanrelid, int ctePlanId, int cteParam);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -279,7 +280,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
 
 /*
  * create_plan
@@ -980,8 +981,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1007,7 +1012,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1016,7 +1028,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1026,7 +1049,8 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5161,7 +5185,7 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist)
+make_append(List *appendplans, int nasyncplans,	int referent, List *tlist)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5171,6 +5195,8 @@ make_append(List *appendplans, List *tlist)
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6492,3 +6518,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7cacb1e..1a47c2a 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3404,6 +3404,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 		/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5c82325..779ffb5 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4264,7 +4264,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f856f60..0308afc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -358,6 +358,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -437,6 +463,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area   *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -1182,17 +1234,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index b880dc1..0d4f285 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -228,6 +228,8 @@ typedef struct Append
 {
 	Plan		plan;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 60c78d1..3265a48 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -789,7 +789,8 @@ typedef enum
 	WAIT_EVENT_PARALLEL_FINISH,
 	WAIT_EVENT_PARALLEL_BITMAP_SCAN,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 7cf7a75a323634c3f89bb38167bd2a83b2fa8d13 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 12d3742..f44c40a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -208,7 +208,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -249,7 +249,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 7c1fca8aae368466300e6c48f650f3ba1d310577 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out | 120 +++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  12 +-
 5 files changed, 583 insertions(+), 152 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 0b9e3e4..90691e5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6401,34 +6401,39 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6438,34 +6443,39 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6494,11 +6504,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Hash Cond: (bar2.f1 = foo.f1)
@@ -6511,11 +6521,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6546,16 +6556,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6573,16 +6583,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6733,27 +6743,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 990313a..093fa1a 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,10 +128,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -136,7 +159,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -152,6 +175,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -165,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -192,6 +222,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -290,6 +321,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -350,6 +382,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -370,7 +410,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -435,6 +478,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -469,6 +513,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1320,12 +1370,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1381,32 +1440,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1422,7 +1579,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1430,6 +1587,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1458,9 +1618,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1478,7 +1638,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1486,16 +1646,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1697,7 +1873,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1776,6 +1954,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1786,14 +1966,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1801,10 +1981,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1842,6 +2022,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1862,14 +2044,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1877,10 +2059,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1918,6 +2100,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1938,14 +2122,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1953,10 +2137,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2003,16 +2187,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2292,7 +2476,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2345,7 +2531,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2392,8 +2581,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2512,6 +2701,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2555,6 +2745,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2909,11 +3109,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2979,47 +3179,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3029,27 +3278,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3133,7 +3437,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3143,12 +3447,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3156,9 +3460,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3289,9 +3593,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3299,10 +3603,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4445,6 +4749,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4802,7 +5180,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 46cac55..b3ac615 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -103,6 +104,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 56b01d0..4dca0c4 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1511,12 +1511,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1575,8 +1575,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 5b685ff78f11ee08c385d7a6c793f4d7cfc164e3 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 0fad806..1efdeb4 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index ea7f930..7a8059f 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -61,6 +61,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -89,6 +90,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -323,7 +326,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -481,12 +484,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -546,6 +552,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -581,6 +592,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  &MyProc->procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

Import Notes

Reply to msg id not found: CADkLMcBZEX9L9HnhJYrtfiAN5EbduxbvM_poWVGBR7yN3gVw@mail.gmail.come7dc8128-f32b-ff9a-870e-f1117b8e4fa6@lab.ntt.co.jp

#36

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#35)

Hello. This is the final report in this CF period.

At Fri, 17 Mar 2017 17:35:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170317.173505.152063931.horiguchi.kyotaro@lab.ntt.co.jp>

Async-capable plan is generated in planner. An Append contains at
least one async-capable child becomes async-aware Append. So the
async feature should be effective also for the UNION ALL case.

The following will work faster than unpatched version.I

SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40) as ft;

I'll measure the performance for the case next week.

I found that the following query works as the same as partitioned
table.

SELECT sum(a) FROM (SELECT a FROM ft10 UNION ALL SELECT a FROM ft20 UNION ALL SELECT a FROM ft30 UNION ALL SELECT a FROM ft40 UNION ALL *SELECT a FROM ONLY pf0*) as ft;

So, the difference comes from the additional async-uncapable
child (faster if contains any). In both cases, Append node runs
children asynchronously but slightly differently when all
async-capable children are busy.

I'll continue working on this from this point aiming to the next
commit fest.

Thank you for valuable feedback.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Corey Huinker

corey.huinker@gmail.com

almost 9 years ago

In reply to: Kyotaro HORIGUCHI (#36)

I'll continue working on this from this point aiming to the next
commit fest.

This probably will not surprise you given the many commits in the past 2
weeks, but the patches no longer apply to master:

$ git apply
~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27:
trailing whitespace.
FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:39:
trailing whitespace.
#include "utils/resowner_private.h"
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:47:
trailing whitespace.
ResourceOwner resowner; /* Resource owner */
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:48:
trailing whitespace.

/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:57:
trailing whitespace.
WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL,
3);
error: patch failed: src/backend/libpq/pqcomm.c:201
error: src/backend/libpq/pqcomm.c: patch does not apply
error: patch failed: src/backend/storage/ipc/latch.c:61
error: src/backend/storage/ipc/latch.c: patch does not apply
error: patch failed: src/backend/storage/lmgr/condition_variable.c:66
error: src/backend/storage/lmgr/condition_variable.c: patch does not apply
error: patch failed: src/backend/utils/resowner/resowner.c:124
error: src/backend/utils/resowner/resowner.c: patch does not apply
error: patch failed: src/include/storage/latch.h:101
error: src/include/storage/latch.h: patch does not apply
error: patch failed: src/include/utils/resowner_private.h:18
error: src/include/utils/resowner_private.h: patch does not apply

#38

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 9 years ago

In reply to: Corey Huinker (#37)

5 attachment(s)

Hello,

At Sun, 2 Apr 2017 12:21:14 -0400, Corey Huinker <corey.huinker@gmail.com> wrote in <CADkLM=dN_vt8kazOoiVOfjN6xFHpzf5uiGJz+iN+f4fLbYwSKA@mail.gmail.com>

I'll continue working on this from this point aiming to the next
commit fest.

This probably will not surprise you given the many commits in the past 2
weeks, but the patches no longer apply to master:

Yeah, I won't surprise by that but thank you for noticing
me. Greately reduces the difficulty of merging. Thank you.

$ git apply
~/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch
/home/ubuntu/async/0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patch:27:
trailing whitespace.

Maybe the patch was retrieved on Windows then transferred to
Linux box. Converting EOLs of the files or some git configuration
might save that. (git am has --no-keep-cr but I haven't find that
for git apply)

The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From e4c38a11171e8c6c6a1950f122b97b5048c7c5f8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 0fad806..1efdeb4 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4798370..a3372bd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -61,6 +61,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -89,6 +90,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -323,7 +326,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -481,12 +484,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -546,6 +552,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -581,6 +592,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 6f1ef0b..503aef1 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  &MyProc->procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From 505fb96f7ca0a3cc729311e68dbd010fdb098c27 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/README             |  45 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 520 ++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c     |   1 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 +++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  69 ++++-
 src/backend/postmaster/pgstat.c         |   2 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/execAsync.h        |  30 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  17 ++
 src/include/nodes/execnodes.h           |  65 +++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 974 insertions(+), 29 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7e85c66..ddb6d64 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -478,11 +478,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		nasync = estate->es_num_pending_async;
+
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (nasync >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[nasync] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[nasync] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[nasync];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async;
+
+	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
+
+	return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	Assert(estate->es_async_callback_pending == 0);
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for events only if any node is async-not-ready. */
+		if (estate->es_num_async_ready < estate->es_num_pending_async)
+		{
+			/* Don't block if any tuple available. */
+			if (estate->es_async_callback_pending > 0)
+				ExecAsyncEventWait(estate, 0);
+			else if (!ExecAsyncEventWait(estate, cur_timeout))
+			{	/* Not fired */
+				/* Exited before timeout. Calculate the remaining time. */
+				instr_time      cur_time;
+				long            cur_timeout = -1;
+
+				/* Wait forever  */
+				if (timeout < 0)
+					continue;
+
+				INSTR_TIME_SET_CURRENT(cur_time);
+				INSTR_TIME_SUBTRACT(cur_time, start_time);
+				cur_timeout =
+					timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+				if (cur_timeout > 0)
+					continue;
+			}
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+				ExecAsyncNotify(estate, areq);
+
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
+			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+	bool	added = false;
+	bool	fired = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+		/*
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
+		 */
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
+
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/*
+	 * We may have no event to wait. This occurs when all nodes that
+	 * is asynchronously executing have tuples immediately available.
+	 */
+	if (!added)
+		return true;
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			Assert(areq->state == ASYNCREQ_WAITING);
+
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
+			fired = true;
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
+				estate->es_async_callback_pending++;
+				fired = true;
+			}
+		}
+	}
+
+	return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+
+	estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+	estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNCREQ_WAITING;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+		ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->state = ASYNCREQ_COMPLETE;
+	estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+	if (estate->es_wait_event_set == NULL)
+		return;
+
+	FreeWaitEventSet(estate->es_wait_event_set);
+	estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 486ddf1..2f896ef 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a107545..d91e621 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9ae1561..7db5c30 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 61bc502..9856dfb 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -239,6 +239,8 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(partitioned_rels);
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 83fb39f..f324b0c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -372,6 +372,8 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 766f2d8..8c57d81 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1575,6 +1575,8 @@ _readAppend(void)
 
 	READ_NODE_FIELD(partitioned_rels);
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index b121f40..c6825d2 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 			 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -283,7 +284,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
 
 /*
  * create_plan
@@ -992,8 +993,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1019,7 +1024,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1028,7 +1040,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1038,7 +1061,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5245,17 +5270,23 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,	int referent,
+			List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
 
+	/* Currently async on partitioned tables is not available */
+	Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
 	plan->targetlist = tlist;
 	plan->qual = NIL;
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6578,3 +6609,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 56a8bf2..fbcdba6 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3571,6 +3571,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 			break;
 		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
 			event_name = "LogicalSyncStateChange";
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
 			break;
 		/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 0c1a201..e224158 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4335,7 +4335,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fa99244..735a157 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -395,6 +395,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -476,6 +502,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area   *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -939,17 +991,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a2dd26f..15f4de9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -235,6 +235,8 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e29397f..8bcfcb2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -811,7 +811,8 @@ typedef enum
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
-	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 741b974f971f7f94fdc9cc7bf76db9c73767b7d6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out | 120 +++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  12 +-
 5 files changed, 583 insertions(+), 152 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 1a9e6c8..88f0c7e 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6534,34 +6534,39 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6571,34 +6576,39 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(22 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(27 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6627,11 +6637,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Hash Cond: (bar2.f1 = foo.f1)
@@ -6644,11 +6654,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (37 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6679,16 +6689,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6706,16 +6716,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6866,27 +6876,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2851869..2347fba 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -122,10 +128,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -136,7 +159,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -152,6 +175,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -165,11 +195,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -192,6 +222,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -290,6 +321,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -350,6 +382,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -370,7 +410,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -435,6 +478,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -469,6 +513,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1327,12 +1377,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1388,32 +1447,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1429,7 +1586,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1437,6 +1594,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1465,9 +1625,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1485,7 +1645,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1493,16 +1653,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1704,7 +1880,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1783,6 +1961,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1793,14 +1973,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1808,10 +1988,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1849,6 +2029,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1869,14 +2051,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1884,10 +2066,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1925,6 +2107,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1945,14 +2129,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1960,10 +2144,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2010,16 +2194,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2299,7 +2483,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2352,7 +2538,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2399,8 +2588,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2519,6 +2708,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2561,6 +2751,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2915,11 +3115,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2985,47 +3185,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3035,27 +3284,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3139,7 +3443,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3149,12 +3453,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3162,9 +3466,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3295,9 +3599,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3305,10 +3609,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4502,6 +4806,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4859,7 +5237,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 57dbb79..1194d29 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,6 +79,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -117,6 +118,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index cf70ca2..d161a8e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1534,12 +1534,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1598,8 +1598,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From f7a0e01e079af33059aa366af18105727b9a0ce0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index d91e621..2bdcee6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0005-Fix-a-typo-of-mcxt.c.patchtext/x-patch; charset=us-asciiDownload

From 26d427bd57c4b5019097a3c1586c14fd7786c7a9 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:14:15 +0900
Subject: [PATCH 5/5] Fix a typo of mcxt.c

---
 src/backend/utils/mmgr/mcxt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c
index 6668bf1..d1598c5 100644
--- a/src/backend/utils/mmgr/mcxt.c
+++ b/src/backend/utils/mmgr/mcxt.c
@@ -208,7 +208,7 @@ MemoryContextDelete(MemoryContext context)
 	MemoryContextDeleteChildren(context);
 
 	/*
-	 * It's not entirely clear whether 'tis better to do this before or after
+	 * It's not entirely clear whether it's better to do this before or after
 	 * delinking the context; but an error in a callback will likely result in
 	 * leaking the whole context (if it's not a root context) if we do it
 	 * after, so let's do it before.
-- 
2.9.2

#39

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#38)

4 attachment(s)

Hello.

At Tue, 04 Apr 2017 19:25:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170404.192539.29699823.horiguchi.kyotaro@lab.ntt.co.jp>

The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.

This is just rebased onto the current master (d761fe2).
I'll recheck further detail after this.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d1cc38b..1c34114 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 53e6bf2..8c182a2 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 5afb211..1d9111e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  &MyProc->procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From 1fd1847c105ddd1ed2d10cd9043081d642e6a57f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:46:48 +0900
Subject: [PATCH 2/4] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/README             |  45 +++++++++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execProcnode.c     |   1 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 +++++++++++++++++++++++++++++---
 src/backend/executor/nodeForeignscan.c  |  49 +++++++++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  69 +++++++++++--
 src/backend/postmaster/pgstat.c         |   2 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 ++
 src/include/foreign/fdwapi.h            |  17 ++++
 src/include/nodes/execnodes.h           |  65 +++++++++++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 19 files changed, 424 insertions(+), 29 deletions(-)

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5469cde..2b727c0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9ae1561..7db5c30 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7811ad5..8cd0821 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(partitioned_rels);
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4949d58..2d50b8a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -376,6 +376,8 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index e24f5d6..fae9396 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1579,6 +1579,8 @@ _readAppend(void)
 
 	READ_NODE_FIELD(partitioned_rels);
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94beeb8..9c29787 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
 
 /*
  * create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5268,17 +5293,23 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,	int referent,
+			List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
 
+	/* Currently async on partitioned tables is not available */
+	Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
 	plan->targetlist = tlist;
 	plan->qual = NIL;
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6608,3 +6639,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f453dad..97337bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 			break;
 		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
 			event_name = "LogicalSyncStateChange";
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
 			break;
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 9234bc2..0ed6d2c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4425,7 +4425,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d33392f..b58c66e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -395,6 +395,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -486,6 +512,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -950,17 +1002,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d84372d..8bace1f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e029c0..7537ce2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
-	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From ad2cb622293b3888e0cc7c590f517b5e1b4e5d74 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:49:41 +0900
Subject: [PATCH 3/4] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 +++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 591 insertions(+), 160 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4d86ab5..c1c0320 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6414,7 +6414,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6442,7 +6442,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6470,7 +6470,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6498,7 +6498,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6564,35 +6564,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6602,35 +6607,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6660,11 +6670,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6678,11 +6688,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6713,16 +6723,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6740,16 +6750,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6900,27 +6910,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..6c8da30 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 25c950d..6dd136c 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 509bb54..3370778 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1488,25 +1488,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1542,12 +1542,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1606,8 +1606,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of.patchtext/x-patch; charset=us-asciiDownload

From e70aca71198c32cd810c0bd728a24aef221b8230 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:50:26 +0900
Subject: [PATCH 4/4] Apply unlikely to suggest synchronous route of 
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

#40

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#39)

4 attachment(s)

At Mon, 22 May 2017 13:12:14 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170522.131214.20936668.horiguchi.kyotaro@lab.ntt.co.jp>

The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.

This is just rebased onto the current master (d761fe2).
I'll recheck further detail after this.

Sorry, the patch was missing some files to add.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From b849bbbec1c3b9ba62a30c25ac34557a9e279770 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:46:48 +0900
Subject: [PATCH 2/4] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/README             |  45 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 520 ++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c     |   1 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 +++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  69 ++++-
 src/backend/postmaster/pgstat.c         |   2 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/execAsync.h        |  30 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  17 ++
 src/include/nodes/execnodes.h           |  65 +++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 974 insertions(+), 29 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		nasync = estate->es_num_pending_async;
+
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (nasync >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[nasync] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[nasync] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[nasync];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async;
+
+	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
+
+	return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	Assert(estate->es_async_callback_pending == 0);
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for events only if any node is async-not-ready. */
+		if (estate->es_num_async_ready < estate->es_num_pending_async)
+		{
+			/* Don't block if any tuple available. */
+			if (estate->es_async_callback_pending > 0)
+				ExecAsyncEventWait(estate, 0);
+			else if (!ExecAsyncEventWait(estate, cur_timeout))
+			{	/* Not fired */
+				/* Exited before timeout. Calculate the remaining time. */
+				instr_time      cur_time;
+				long            cur_timeout = -1;
+
+				/* Wait forever  */
+				if (timeout < 0)
+					continue;
+
+				INSTR_TIME_SET_CURRENT(cur_time);
+				INSTR_TIME_SUBTRACT(cur_time, start_time);
+				cur_timeout =
+					timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+				if (cur_timeout > 0)
+					continue;
+			}
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+				ExecAsyncNotify(estate, areq);
+
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
+			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+	bool	added = false;
+	bool	fired = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+		/*
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
+		 */
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
+
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/*
+	 * We may have no event to wait. This occurs when all nodes that
+	 * is asynchronously executing have tuples immediately available.
+	 */
+	if (!added)
+		return true;
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			Assert(areq->state == ASYNCREQ_WAITING);
+
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
+			fired = true;
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
+				estate->es_async_callback_pending++;
+				fired = true;
+			}
+		}
+	}
+
+	return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+
+	estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+	estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNCREQ_WAITING;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+		ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->state = ASYNCREQ_COMPLETE;
+	estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+	if (estate->es_wait_event_set == NULL)
+		return;
+
+	FreeWaitEventSet(estate->es_wait_event_set);
+	estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5469cde..2b727c0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9ae1561..7db5c30 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7811ad5..8cd0821 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(partitioned_rels);
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 4949d58..2d50b8a 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -376,6 +376,8 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index e24f5d6..fae9396 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1579,6 +1579,8 @@ _readAppend(void)
 
 	READ_NODE_FIELD(partitioned_rels);
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94beeb8..9c29787 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
 
 /*
  * create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5268,17 +5293,23 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,	int referent,
+			List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
 
+	/* Currently async on partitioned tables is not available */
+	Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
 	plan->targetlist = tlist;
 	plan->qual = NIL;
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6608,3 +6639,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f453dad..97337bd 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 			break;
 		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
 			event_name = "LogicalSyncStateChange";
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
 			break;
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 9234bc2..0ed6d2c 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4425,7 +4425,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 6fb4662..3cbf9ff 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 1b167b8..e4ba4a9 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif   /* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 6ca44f7..863ff0e 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index d33392f..b58c66e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -395,6 +395,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -486,6 +512,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -950,17 +1002,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index d84372d..8bace1f 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 5e029c0..7537ce2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
-	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From a902431be043ad0e930f03f77faa716ccb286360 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:49:41 +0900
Subject: [PATCH 3/4] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 +++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 591 insertions(+), 160 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index c6e3d44..d8ded74 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -49,6 +49,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -64,6 +65,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -75,26 +77,12 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -122,11 +110,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -139,8 +124,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -177,6 +193,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4d86ab5..c1c0320 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6414,7 +6414,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6442,7 +6442,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6470,7 +6470,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6498,7 +6498,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6564,35 +6564,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6602,35 +6607,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6660,11 +6670,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6678,11 +6688,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6713,16 +6723,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6740,16 +6750,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6900,27 +6910,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..6c8da30 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 25c950d..6dd136c 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 509bb54..3370778 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1488,25 +1488,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1542,12 +1542,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1606,8 +1606,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 000f0465a59cdabd02f43e886c76c89c14d987a5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index d1cc38b..1c34114 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 53e6bf2..8c182a2 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index 5afb211..1d9111e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  &MyProc->procLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index af46d78..a1a1121 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 3158d7b..8233b6d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 411d08f..0c6979a 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif   /* RESOWNER_PRIVATE_H */
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of.patchtext/x-patch; charset=us-asciiDownload

From a02948883a160953ed2fac65c15c266d52f2163d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:50:26 +0900
Subject: [PATCH 4/4] Apply unlikely to suggest synchronous route of 
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

#41

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#40)

4 attachment(s)

The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

The attached patch is rebased on the current master, but no
substantial changes other than disallowing partitioned tables on
async by assertion.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 32d5c143a679bcccee9ff29fe3807dfd8deae458 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/4] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 261e9be..c4f336d 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -201,7 +201,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4a4a287..f2509c3 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfaf..392c1d6 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From 1bb440d25eddcbfeff8d3f032432edca15e43477 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/4] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/README             |  45 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 520 ++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c     |   1 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 +++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  69 ++++-
 src/backend/postmaster/pgstat.c         |   2 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/execAsync.h        |  30 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  17 ++
 src/include/nodes/execnodes.h           |  65 +++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 974 insertions(+), 29 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		nasync = estate->es_num_pending_async;
+
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (nasync >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[nasync] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[nasync] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[nasync];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async;
+
+	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
+
+	return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	Assert(estate->es_async_callback_pending == 0);
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for events only if any node is async-not-ready. */
+		if (estate->es_num_async_ready < estate->es_num_pending_async)
+		{
+			/* Don't block if any tuple available. */
+			if (estate->es_async_callback_pending > 0)
+				ExecAsyncEventWait(estate, 0);
+			else if (!ExecAsyncEventWait(estate, cur_timeout))
+			{	/* Not fired */
+				/* Exited before timeout. Calculate the remaining time. */
+				instr_time      cur_time;
+				long            cur_timeout = -1;
+
+				/* Wait forever  */
+				if (timeout < 0)
+					continue;
+
+				INSTR_TIME_SET_CURRENT(cur_time);
+				INSTR_TIME_SUBTRACT(cur_time, start_time);
+				cur_timeout =
+					timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+				if (cur_timeout > 0)
+					continue;
+			}
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+				ExecAsyncNotify(estate, areq);
+
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
+			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+	bool	added = false;
+	bool	fired = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+		/*
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
+		 */
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
+
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/*
+	 * We may have no event to wait. This occurs when all nodes that
+	 * is asynchronously executing have tuples immediately available.
+	 */
+	if (!added)
+		return true;
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			Assert(areq->state == ASYNCREQ_WAITING);
+
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
+			fired = true;
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
+				estate->es_async_callback_pending++;
+				fired = true;
+			}
+		}
+	}
+
+	return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+
+	estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+	estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNCREQ_WAITING;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+		ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->state = ASYNCREQ_COMPLETE;
+	estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+	if (estate->es_wait_event_set == NULL)
+		return;
+
+	FreeWaitEventSet(estate->es_wait_event_set);
+	estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2c..8f8ad2c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9cde112..1df8ccb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 67ac814..7e5bb38 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(partitioned_rels);
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 3a23f0b..030ed8e 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -376,6 +376,8 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2988e8b..0615d52 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1579,6 +1579,8 @@ _readAppend(void)
 
 	READ_NODE_FIELD(partitioned_rels);
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e589d92..c341805 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
 
 /*
  * create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5269,17 +5294,23 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,	int referent,
+			List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
 
+	/* Currently async on partitioned tables is not available */
+	Assert(nasyncplans == 0 || partitioned_rels == NIL);
+
 	plan->targetlist = tlist;
 	plan->qual = NIL;
 	plan->lefttree = NULL;
 	plan->righttree = NULL;
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6609,3 +6640,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 65b7b32..25c84bc 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 			break;
 		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
 			event_name = "LogicalSyncStateChange";
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
 			break;
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 18d9e27..c7e69cb 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4425,7 +4425,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index ee0b6ad..d8c3e31 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif							/* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3ff4ecd..e6ba392 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e391f20..57876d1 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 													RelOptInfo *rel,
 													RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 54c5cf5..225cb1e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -415,6 +415,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -506,6 +532,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -967,17 +1019,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..5abff26 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63..fb6d02a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
-	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 0b279ad32ea441580ead8056c855119c3d871aca Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/4] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  79 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out | 144 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  12 +-
 5 files changed, 595 insertions(+), 164 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8c33dea..0b1af3b 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -53,6 +53,7 @@ typedef struct ConnCacheEntry
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
 	bool		changing_xact_state;	/* xact state change in process */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -68,6 +69,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -85,26 +87,12 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 						 PGresult **result);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -132,11 +120,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -150,11 +135,42 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
 		entry->changing_xact_state = false;
+		entry->storage = NULL;
 	}
 
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -191,6 +207,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index b112c19..7401304 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6417,12 +6417,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6445,12 +6445,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6473,12 +6473,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6501,12 +6501,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6564,35 +6564,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6602,35 +6607,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6660,11 +6670,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6678,11 +6688,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6713,16 +6723,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6740,16 +6750,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6900,27 +6910,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 7214666..b09a099 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index f396dae..a67da3d 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 509bb54..1f69908 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1542,12 +1542,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1606,8 +1606,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From cfc22e4a0cf8597ef13b82c6e177ce90a2444d78 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/4] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

#42

Antonin Houska

ah@cybertec.at

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#41)

1 attachment(s)

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

#43

Antonin Houska

ah@cybertec.at

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#41)

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

Just one idea that I had while reading the code.

In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.

I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.

ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).

And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.

--
Antonin Houska
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de, http://www.cybertec.at

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Antonin Houska (#42)

Thank you for looking this.

At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska <ah@cybertec.at> wrote in <4579.1498638234@localhost>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

It was just a developmental sentinel that will remind me later to
consider the declarative partitions since I didn't have an idea
of the differences (or the similarity) between appendrels and
partitioned_rels. It is never to say the condition cannot
make. I'll check it out and will support partitioned_rels sooner.
Sorry for having left it as it is.

And even if there was some, the planner should ensure that executor does not
fire the assertion statement above. The script attached shows an example how
to cause the assertion failure.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#44)

Hi,

On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:

Thank you for looking this.

At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

It was just a developmental sentinel that will remind me later to
consider the declarative partitions since I didn't have an idea
of the differences (or the similarity) between appendrels and
partitioned_rels. It is never to say the condition cannot
make. I'll check it out and will support partitioned_rels sooner.
Sorry for having left it as it is.

When making an Append for a partitioned table, among the arguments passed
to make_append(), 'partitioned_rels' is a list of RT indexes of
partitioned tables in the inheritance tree of which the aforementioned
partitioned table is the root. 'appendplans' is a list of subplans for
scanning the leaf partitions in the tree. Note that the 'appendplans'
list contains no members corresponding to the partitioned tables, because
we don't need to scan them (only leaf relations contain any data).

The point of having the 'partitioned_rels' list in the resulting Append
plan is so that the executor can identify those relations and take the
appropriate locks on them.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Amit Langote (#45)

Hi, I've returned.

At Thu, 29 Jun 2017 14:08:27 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in <63a5a01c-2967-83e0-8bbf-c981404f529e@lab.ntt.co.jp>

Hi,

On 2017/06/29 13:45, Kyotaro HORIGUCHI wrote:

Thank you for looking this.

At Wed, 28 Jun 2017 10:23:54 +0200, Antonin Houska wrote:

Can you please explain this part of make_append() ?

/* Currently async on partitioned tables is not available */
Assert(nasyncplans == 0 || partitioned_rels == NIL);

I don't think the output of Append plan is supposed to be ordered even if the
underlying relation is partitioned. Besides ordering, is there any other
reason not to use the asynchronous execution?

When making an Append for a partitioned table, among the arguments passed
to make_append(), 'partitioned_rels' is a list of RT indexes of
partitioned tables in the inheritance tree of which the aforementioned
partitioned table is the root. 'appendplans' is a list of subplans for
scanning the leaf partitions in the tree. Note that the 'appendplans'
list contains no members corresponding to the partitioned tables, because
we don't need to scan them (only leaf relations contain any data).

The point of having the 'partitioned_rels' list in the resulting Append
plan is so that the executor can identify those relations and take the
appropriate locks on them.

Amit, thank you for the detailed explanation. I understand what
it is and that just ignoring it is enough, then confirmed that
actually works as before.

I'll then adresss Antonin's comments tomorrow.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Antonin Houska (#43)

Thank you for the thought.

This is at PoC level so I'd be grateful for this kind of
fundamental comments.

At Wed, 28 Jun 2017 20:22:24 +0200, Antonin Houska <ah@cybertec.at> wrote in <392.1498674144@localhost>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

The patch got conflicted. This is a new version just rebased to
the current master. Furtuer amendment will be taken later.

Just one idea that I had while reading the code.

In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.

I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.

ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).

And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.

Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.

The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.

Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.

Sorry for the discursive discussion but in short, I have noticed
that I have a lot to consider on this:p Thanks!

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Antonin Houska

ah@cybertec.at

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#47)

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Just one idea that I had while reading the code.

In ExecAsyncEventLoop you iterate estate->es_pending_async, then move the
complete requests to the end and finaly adjust estate->es_num_pending_async so
that the array no longer contains the complete requests. I think the point is
that then you can add new requests to the end of the array.

I wonder if a set (Bitmapset) of incomplete requests would make the code more
efficient. The set would contain position of each incomplete request in
estate->es_num_pending_async (I think it's the myindex field of
PendingAsyncRequest). If ExecAsyncEventLoop used this set to retrieve the
requests subject to ExecAsyncNotify etc, then the compaction of
estate->es_pending_async wouldn't be necessary.

ExecAsyncRequest would use the set to look for space for new requests by
iterating it and trying to find the first gap (which corresponds to completed
request).

And finally, item would be removed from the set at the moment the request
state is being set to ASYNCREQ_COMPLETE.

Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.

The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.

Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.

ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.

Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?

/* If any node completed, compact the array. */
if (any_node_done)
{
int hidx = 0,
tidx;

/*
* Swap all non-yet-completed items to the start of the array.
* Keep them in the same order.
*/
for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
{
PendingAsyncRequest *tail = estate->es_pending_async[tidx];

Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);

if (tail->state == ASYNCREQ_COMPLETE)
continue;

/*
* If the array starts with one or more incomplete requests,
* both head and tail point at the same item, so there's no
* point in swapping.
*/
if (tidx > hidx)
{
PendingAsyncRequest *head = estate->es_pending_async[hidx];

/*
* Once the tail got ahead, it should only leave
* ASYNCREQ_COMPLETE behind. Only those can then be seen
* by head.
*/
Assert(head->state == ASYNCREQ_COMPLETE);

estate->es_pending_async[tidx] = head;
estate->es_pending_async[hidx] = tail;
}

++hidx;
}

estate->es_num_pending_async = hidx;
}

And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.

--
Antonin Houska Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener
Neustadt Web: http://www.postgresql-support.de, http://www.cybertec.at

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Antonin Houska (#48)

Hello,

At Tue, 11 Jul 2017 10:28:51 +0200, Antonin Houska <ah@cybertec.at> wrote in <6448.1499761731@localhost>

Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Effectively it is a waiting-queue followed by a
completed-list. The point of the compaction is keeping the order
of waiting or not-yet-completed requests, which is crucial to
avoid kind-a precedence inversion. We cannot keep the order by
using bitmapset in such way.

The current code waits all waiters at once and processes all
fired events at once. The order in the waiting-queue is
inessential in the case. On the other hand I suppoese waiting on
several-tens to near-hundred remote hosts is in a realistic
target range. Keeping the order could be crucial if we process a
part of the queue at once in the case.

Putting siginificance on the deviation of response time of
remotes, process-all-at-once is effective. In turn we should
consider the effectiveness of the lifecycle of the larger wait
event set.

ok, I missed the fact that the order of es_pending_async entries is
important. I think this is worth adding a comment.

I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.

Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?

I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.

/* If any node completed, compact the array. */
if (any_node_done)
{

...

for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
{

...

if (tail->state == ASYNCREQ_COMPLETE)
continue;

/*
* If the array starts with one or more incomplete requests,
* both head and tail point at the same item, so there's no
* point in swapping.
*/
if (tidx > hidx)
{

This works to skip the first several elements when all of them
are ASYNCREQ_COMPLETE. I think it makes sense as long as it
doesn't harm the loop. The optimization is more effective by
putting out of the loop like this.

| for (tidx = 0; tidx < estate->es_num_pending_async &&
estate->es_pending_async[tidx] == ASYNCREQ_COMPLETE; ++tidx);
| for (; tidx < estate->es_num_pending_async; ++tidx)
...

And besides that, I think it'd be more intuitive if the meaning of "head" and
"tail" was reversed: if the array is iterated from lower to higher positions,
then I'd consider head to be at higher position, not tail.

Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something. It might be less
confusing by rewriting it in more verbose-but-straightforwad way.

| int npending = 0;
|
| /* Skip over not-completed items at the beginning */
| while (npending < estate->es_num_pending_async &&
| estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
| npending++;
|
| /* Scan over the rest for not-completed items */
| for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
| {
| PendingAsyncRequest *tmp;
| PendingAsyncRequest *curr = estate->es_pending_async[i];
|
| if (curr->state == ASYNCREQ_COMPLETE)
| continue;
|
| /* Move the not-completed item to the tail of the first chunk */
| tmp = estate->es_pending_async[i];
| estate->es_pending_async[nepending] = tmp;
| estate->es_pending_async[i] = tmp;
| ++npending;
| }

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#49)

5 attachment(s)

Hello,

8bf58c0d9bd33686 badly conflicts with this patch, so I'll rebase
this and added a patch to refactor the function that Anotonin
pointed. This would be merged into 0002 patch.

At Tue, 18 Jul 2017 16:24:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170718.162452.221576658.horiguchi.kyotaro@lab.ntt.co.jp>

I'll put an upper limit to the number of waiters processed at
once. Then add a comment like that.

Actually the reason I thought of simplification was that I noticed small
inefficiency in the way you do the compaction. In particular, I think it's not
always necessary to swap the tail and head entries. Would something like this
make sense?

I'm not sure, but I suppose that it is rare that all of the first
many elements in the array are not COMPLETE. In most cases the
first element gets a response first.

...

Yeah, but maybe the "head" is still confusing even if reversed
because it is still not a head of something. It might be less
confusing by rewriting it in more verbose-but-straightforwad way.

| int npending = 0;
|
| /* Skip over not-completed items at the beginning */
| while (npending < estate->es_num_pending_async &&
| estate->es_pending_async[npending] != ASYNCREQ_COMPLETE)
| npending++;
|
| /* Scan over the rest for not-completed items */
| for (i = npending + 1 ; i < estate->es_num_pending_async; ++i)
| {
| PendingAsyncRequest *tmp;
| PendingAsyncRequest *curr = estate->es_pending_async[i];
|
| if (curr->state == ASYNCREQ_COMPLETE)
| continue;
|
| /* Move the not-completed item to the tail of the first chunk */
| tmp = estate->es_pending_async[i];
| estate->es_pending_async[nepending] = tmp;
| estate->es_pending_async[i] = tmp;
| ++npending;
| }

The last patch does something like this (with apparent bugs
fixed)

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 41ad9a7518c066da619363e6cdf8574fa00ee1e5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Wed, 22 Feb 2017 09:07:49 +0900
Subject: [PATCH 1/5] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 4452ea4..ed71e7c 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364..9543397 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -518,12 +521,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -592,6 +598,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -633,6 +644,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4a4a287..f2509c3 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfaf..392c1d6 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -152,7 +153,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-Asynchronous-execution-framework.patchtext/x-patch; charset=us-asciiDownload

From afb9353f48dca75c6ab4d6db7a1378d61059e78c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 12:20:31 +0900
Subject: [PATCH 2/5] Asynchronous execution framework

This is a framework for asynchronous execution based on Robert Haas's
proposal. Any executor node can receive tuples from underlying nodes
asynchronously by this. This is a different mechanism from parallel
execution. While the parallel execution is analogous to threads, this
frame work is analogous to select(2), which handles multiple input on
single backend process. To avoid degradation of non-async execution,
this framework uses completely different channel to convey tuples.
You will see the deatil of the API at the end of
src/backend/executor/README.
---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/README             |  45 +++
 src/backend/executor/execAmi.c          |   5 +
 src/backend/executor/execAsync.c        | 520 ++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c     |   1 +
 src/backend/executor/instrument.c       |   2 +-
 src/backend/executor/nodeAppend.c       | 169 ++++++++++-
 src/backend/executor/nodeForeignscan.c  |  49 +++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  66 +++-
 src/backend/postmaster/pgstat.c         |   2 +
 src/backend/utils/adt/ruleutils.c       |   6 +-
 src/include/executor/execAsync.h        |  30 ++
 src/include/executor/nodeAppend.h       |   3 +
 src/include/executor/nodeForeignscan.h  |   7 +
 src/include/foreign/fdwapi.h            |  17 ++
 src/include/nodes/execnodes.h           |  65 +++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 21 files changed, 971 insertions(+), 29 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/README b/src/backend/executor/README
index a004506..e6caeb7 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -349,3 +349,48 @@ query returning the same set of scan tuples multiple times.  Likewise,
 SRFs are disallowed in an UPDATE's targetlist.  There, they would have the
 effect of the same row being updated multiple times, which is not very
 useful --- and updates after the first would have no effect anyway.
+
+Asynchronous Execution
+----------------------
+
+In certain cases, it's desirable for a node to indicate that it cannot
+return any tuple immediately but may be able to do at a later time.  This
+might either because the node is waiting on an event external to the
+database system, such as a ForeignScan awaiting network I/O, or because
+the node is waiting for an event internal to the database system - e.g.
+one process involved in a parallel query may find that it cannot progress
+a certain parallel operation until some other process reaches a certain
+point in the computation.  A process which discovers this type of situation
+can always handle it simply by blocking, but this may waste time that could
+be spent executing some other part of the plan where progress could be
+made immediately.  This is particularly likely to occur when the plan
+contains an Append node.
+
+To use asynchronous execution, a node must first request a tuple from
+an async-capable child node using ExecAsyncRequest.  Next, when the
+result is not available immediately, it must execute the asynchronous
+event loop using ExecAsyncEventLoop; it can avoid giving up control
+indefinitely by passing a timeout to this function, even passing -1 to
+poll for events without blocking.  Eventually, when a node to which an
+asynchronous request has been made produces a tuple, the requesting
+node will receive a callback from the event loop via
+ExecAsyncResponse. Typically, the ExecAsyncResponse callback is the
+only one required for nodes that wish to request tuples
+asynchronously.
+
+On the other hand, nodes that wish to produce tuples asynchronously
+generally need to implement three methods:
+
+1. When an asynchronous request is made, the node's ExecAsyncRequest callback
+will be invoked; it should use ExecAsyncSetRequiredEvents to indicate the
+number of file descriptor events for which it wishes to wait and whether it
+wishes to receive a callback when the process latch is set. Alternatively,
+it can instead use ExecAsyncRequestDone if a result is available immediately.
+
+2. When the event loop wishes to wait or poll for file descriptor events and
+the process latch, the ExecAsyncConfigureWait callback is invoked to configure
+the file descriptor wait events for which the node wishes to wait.  This
+callback isn't needed if the node only cares about the process latch.
+
+3. When file descriptors or the process latch become ready, the node's
+ExecAsyncNotify callback is invoked.
diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 7337d21..4c1991c 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -479,11 +479,16 @@ ExecSupportsBackwardScan(Plan *node)
 			{
 				ListCell   *l;
 
+				/* With async, tuples may be interleaved, so can't back up. */
+				if (((Append *) node)->nasyncplans != 0)
+					return false;
+
 				foreach(l, ((Append *) node)->appendplans)
 				{
 					if (!ExecSupportsBackwardScan((Plan *) lfirst(l)))
 						return false;
 				}
+
 				/* need not check tlist because Append doesn't evaluate it */
 				return true;
 			}
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..115b147
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,520 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/latch.h"
+#include "utils/memutils.h"
+
+static bool ExecAsyncEventWait(EState *estate, long timeout);
+static bool ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+	bool reinit);
+static void ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq);
+static void ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq);
+
+#define EVENT_BUFFER_SIZE	16
+
+/*
+ * Asynchronously request a tuple from a designed async-aware node.
+ *
+ * requestor is the node that wants the tuple; requestee is the node from
+ * which it wants the tuple.  request_index is an arbitrary integer specified
+ * by the requestor which will be available at the time the requestor receives
+ * the tuple.  This is useful if the requestor has multiple children and
+ * needs an easy way to figure out which one is delivering a tuple.
+ */
+void
+ExecAsyncRequest(EState *estate, PlanState *requestor, int request_index,
+				 PlanState *requestee)
+{
+	PendingAsyncRequest *areq = NULL;
+	int		nasync = estate->es_num_pending_async;
+
+	if (requestee->instrument)
+		InstrStartNode(requestee->instrument);
+
+	/*
+	 * If the number of pending asynchronous nodes exceeds the number of
+	 * available slots in the es_pending_async array, expand the array.
+	 * We start with 16 slots, and thereafter double the array size each
+	 * time we run out of slots.
+	 */
+	if (nasync >= estate->es_max_pending_async)
+	{
+		int	newmax;
+
+		newmax = estate->es_max_pending_async * 2;
+		if (estate->es_max_pending_async == 0)
+		{
+			newmax = 16;
+			estate->es_pending_async =
+				MemoryContextAllocZero(estate->es_query_cxt,
+								   newmax * sizeof(PendingAsyncRequest *));
+		}
+		else
+		{
+			int	newentries = newmax - estate->es_max_pending_async;
+
+			estate->es_pending_async =
+				repalloc(estate->es_pending_async,
+						 newmax * sizeof(PendingAsyncRequest *));
+			MemSet(&estate->es_pending_async[estate->es_max_pending_async],
+				   0, newentries * sizeof(PendingAsyncRequest *));
+		}
+		estate->es_max_pending_async = newmax;
+	}
+
+	/*
+	 * To avoid unnecessary palloc traffic, we reuse a previously-allocated
+	 * PendingAsyncRequest if there is one.  If not, we must allocate a new
+	 * one.
+	 */
+	if (estate->es_pending_async[nasync] == NULL)
+	{
+		areq = MemoryContextAllocZero(estate->es_query_cxt,
+									  sizeof(PendingAsyncRequest));
+		estate->es_pending_async[nasync] = areq;
+	}
+	else
+	{
+		areq = estate->es_pending_async[nasync];
+		MemSet(areq, 0, sizeof(PendingAsyncRequest));
+	}
+	areq->myindex = estate->es_num_pending_async;
+
+	/* Initialize the new request. */
+	areq->state = ASYNCREQ_IDLE;
+	areq->requestor = requestor;
+	areq->request_index = request_index;
+	areq->requestee = requestee;
+
+	/* Give the requestee a chance to do whatever it wants. */
+	switch (nodeTag(requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanRequest(estate, areq);
+			break;
+		default:
+			/* If requestee doesn't support async, caller messed up. */
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(requestee));
+	}
+
+	if (areq->requestee->instrument)
+		InstrStopNode(requestee->instrument, 0);
+
+	/* No result available now, make this node pending */
+	estate->es_num_pending_async++;
+
+	return;
+}
+
+/*
+ * Execute the main loop until the timeout expires or a result is delivered
+ * to the requestor.
+ *
+ * If the timeout is -1, there is no timeout; wait indefinitely until a
+ * result is ready for requestor.  If the timeout is 0, do not block, but
+ * poll for events and fire callbacks for as long as we can do so without
+ * blocking.  If timeout is greater than 0, block for at most the number
+ * of milliseconds indicated by the timeout.
+ *
+ * Returns true if a result was delivered to the requestor.  A return value
+ * of false indicates that the timeout was reached without delivering a
+ * result to the requestor.
+ */
+bool
+ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
+{
+	instr_time start_time;
+	long cur_timeout = timeout;
+	bool	requestor_done = false;
+
+	Assert(requestor != NULL);
+
+	/*
+	 * If we plan to wait - but not indefinitely - we need to record the
+	 * current time.
+	 */
+	if (timeout > 0)
+		INSTR_TIME_SET_CURRENT(start_time);
+
+	/* Main event loop: poll for events, deliver notifications. */
+	Assert(estate->es_async_callback_pending == 0);
+	for (;;)
+	{
+		int		i;
+		bool	any_node_done = false;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Check for events only if any node is async-not-ready. */
+		if (estate->es_num_async_ready < estate->es_num_pending_async)
+		{
+			/* Don't block if any tuple available. */
+			if (estate->es_async_callback_pending > 0)
+				ExecAsyncEventWait(estate, 0);
+			else if (!ExecAsyncEventWait(estate, cur_timeout))
+			{	/* Not fired */
+				/* Exited before timeout. Calculate the remaining time. */
+				instr_time      cur_time;
+				long            cur_timeout = -1;
+
+				/* Wait forever  */
+				if (timeout < 0)
+					continue;
+
+				INSTR_TIME_SET_CURRENT(cur_time);
+				INSTR_TIME_SUBTRACT(cur_time, start_time);
+				cur_timeout =
+					timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
+
+				if (cur_timeout > 0)
+					continue;
+			}
+		}
+
+		/* Deliver notifications. */
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->requestee->instrument)
+				InstrStartNode(areq->requestee->instrument);
+
+			/* Notify if the requestee is ready */
+			if (areq->state == ASYNCREQ_CALLBACK_PENDING)
+				ExecAsyncNotify(estate, areq);
+
+			/* Deliver the acquired tuple to the requester */
+			if (areq->state == ASYNCREQ_COMPLETE)
+			{
+				any_node_done = true;
+				if (requestor == areq->requestor)
+					requestor_done = true;
+				ExecAsyncResponse(estate, areq);
+
+				if (areq->requestee->instrument)
+					InstrStopNode(areq->requestee->instrument,
+								  TupIsNull((TupleTableSlot*)areq->result) ?
+								  0.0 : 1.0);
+			}
+			else if (areq->requestee->instrument)
+				InstrStopNode(areq->requestee->instrument, 0);
+		}
+
+		/* If any node completed, compact the array. */
+		if (any_node_done)
+		{
+			int		hidx = 0,
+					tidx;
+
+			/*
+			 * Swap all non-yet-completed items to the start of the array.
+			 * Keep them in the same order.
+			 */
+			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			{
+				PendingAsyncRequest *head;
+				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+
+				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+
+				if (tail->state == ASYNCREQ_COMPLETE)
+					continue;
+				head = estate->es_pending_async[hidx];
+				estate->es_pending_async[tidx] = head;
+				estate->es_pending_async[hidx] = tail;
+				++hidx;
+			}
+			estate->es_num_pending_async = hidx;
+		}
+
+		/*
+		 * We only consider exiting the loop when no notifications are
+		 * pending.  Otherwise, each call to this function might advance
+		 * the computation by only a very small amount; to the contrary,
+		 * we want to push it forward as far as possible.
+		 */
+		if (estate->es_async_callback_pending == 0)
+		{
+			/* If requestor is ready, exit. */
+			if (requestor_done)
+				return true;
+			/* If timeout was 0 or has expired, exit. */
+			if (cur_timeout == 0)
+				return false;
+		}
+	}
+}
+
+/*
+ * Wait or poll for events.  As with ExecAsyncEventLoop, a timeout of -1
+ * means wait forever, 0 means don't wait at all, and >0 means wait for the
+ * indicated number of milliseconds.
+ *
+ * Returns false if we timed out or true if anything found or there's no event
+ * to wait.
+ */
+static bool
+ExecAsyncEventWait(EState *estate, long timeout)
+{
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int		noccurred;
+	int		i;
+	int		n;
+	bool	reinit = false;
+	bool	process_latch_set = false;
+	bool	added = false;
+	bool	fired = false;
+
+	if (estate->es_wait_event_set == NULL)
+	{
+		/*
+		 * Allow for a few extra events without reinitializing.  It
+		 * doesn't seem worth the complexity of doing anything very
+		 * aggressive here, because plans that depend on massive numbers
+		 * of external FDs are likely to run afoul of kernel limits anyway.
+		 */
+		estate->es_allocated_fd_events = estate->es_total_fd_events + 16;
+
+		/*
+		 * The wait event set created here should be live beyond ExecutorState
+		 * context but released in case of error.
+		 */
+		estate->es_wait_event_set =
+			CreateWaitEventSet(TopTransactionContext,
+							   TopTransactionResourceOwner,
+							   estate->es_allocated_fd_events + 1);
+
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+		reinit = true;
+	}
+
+	/* Give each waiting node a chance to add or modify events. */
+	for (i = 0; i < estate->es_num_pending_async; ++i)
+	{
+		PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+		if (areq->num_fd_events > 0 || areq->wants_process_latch)
+			added |= ExecAsyncConfigureWait(estate, areq, reinit);
+	}
+
+	/*
+	 * We may have no event to wait. This occurs when all nodes that
+	 * is asynchronously executing have tuples immediately available.
+	 */
+	if (!added)
+		return true;
+
+	/* Wait for at least one event to occur. */
+	noccurred = WaitEventSetWait(estate->es_wait_event_set, timeout,
+								 occurred_event, EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+
+	if (noccurred == 0)
+		return false;
+
+	/*
+	 * Loop over the occurred events and set the callback_pending flags
+	 * for the appropriate requests.  The waiting nodes should have
+	 * registered their wait events with user_data pointing back to the
+	 * PendingAsyncRequest, but the process latch needs special handling.
+	 */
+	for (n = 0; n < noccurred; ++n)
+	{
+		WaitEvent  *w = &occurred_event[n];
+
+		if ((w->events & WL_LATCH_SET) != 0)
+		{
+			process_latch_set = true;
+			continue;
+		}
+
+		if ((w->events & (WL_SOCKET_READABLE|WL_SOCKET_WRITEABLE)) != 0)
+		{
+			PendingAsyncRequest *areq = w->user_data;
+
+			Assert(areq->state == ASYNCREQ_WAITING);
+
+			areq->state = ASYNCREQ_CALLBACK_PENDING;
+			estate->es_async_callback_pending++;
+			fired = true;
+		}
+	}
+
+	/*
+	 * If the process latch got set, we must schedule a callback for every
+	 * requestee that cares about it.
+	 */
+	if (process_latch_set)
+	{
+		for (i = 0; i < estate->es_num_pending_async; ++i)
+		{
+			PendingAsyncRequest *areq = estate->es_pending_async[i];
+
+			if (areq->wants_process_latch)
+			{
+				Assert(areq->state == ASYNCREQ_WAITING);
+				areq->state = ASYNCREQ_CALLBACK_PENDING;
+				estate->es_async_callback_pending++;
+				fired = true;
+			}
+		}
+	}
+
+	return fired;
+}
+
+/*
+ * Give the asynchronous node a chance to configure the file descriptor
+ * events for which it wishes to wait.  We expect the node-type specific
+ * callback to make one or more calls of the following form:
+ *
+ * AddWaitEventToSet(es->es_wait_event_set, events, fd, NULL, areq);
+ *
+ * The events should include only WL_SOCKET_READABLE or WL_SOCKET_WRITEABLE,
+ * and the number of calls should not exceed areq->num_fd_events (as
+ * prevously set via ExecAsyncSetRequiredEvents).
+ *
+ * Individual requests can omit registering an event but it is a
+ * responsibility of the node driver to set at least one event per one
+ * requestor.
+ */
+static bool
+ExecAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+					   bool reinit)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			return ExecAsyncForeignScanConfigureWait(estate, areq, reinit);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+}
+
+/*
+ * Call the asynchronous node back when a relevant event has occurred.
+ */
+static void
+ExecAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestee))
+	{
+		case T_ForeignScanState:
+			ExecAsyncForeignScanNotify(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestee));
+	}
+
+	estate->es_async_callback_pending--;
+}
+
+/*
+ * Call the requestor back when an asynchronous node has produced a result.
+ */
+static void
+ExecAsyncResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	switch (nodeTag(areq->requestor))
+	{
+		case T_AppendState:
+			ExecAsyncAppendResponse(estate, areq);
+			break;
+		default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(areq->requestor));
+	}
+	estate->es_num_async_ready--;
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one or more file descriptor events that can be registered on a
+ * WaitEventSet, and possibly also on process latch.  num_fd_events is the
+ * maximum number of file descriptor events that it will wish to register.
+ * force_reset should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncSetRequiredEvents(EState *estate, PendingAsyncRequest *areq,
+	int num_fd_events, bool wants_process_latch,
+	bool force_reset)
+{
+	estate->es_total_fd_events += num_fd_events - areq->num_fd_events;
+	areq->num_fd_events = num_fd_events;
+	areq->wants_process_latch = wants_process_latch;
+	areq->state = ASYNCREQ_WAITING;
+
+	if (force_reset && estate->es_wait_event_set != NULL)
+		ExecAsyncClearEvents(estate);
+}
+
+/*
+ * An async-capable node should call this function to deliver the tuple to
+ * the node which requested it.  The node can call this from its
+ * ExecAsyncRequest callback if the requested tuple is available immediately,
+ * or at a later time from its ExecAsyncNotify callback.
+ */
+void
+ExecAsyncRequestDone(EState *estate, PendingAsyncRequest *areq, Node *result)
+{
+	/*
+	 * Since the request is complete, the requestee is no longer allowed
+	 * to wait for any events.  Note that this forces a rebuild of
+	 * es_wait_event_set every time a process that was previously waiting
+	 * stops doing so.  It might be possible to defer that decision until
+	 * we actually wait again, because it's quite possible that a new
+	 * request will be made of the same node before any wait actually
+	 * happens.  However, we have to balance the cost of rebuilding the
+	 * WaitEventSet against the additional overhead of tracking which nodes
+	 * need a callback to remove registered wait events.  It's not clear
+	 * that we would come out ahead, so use brute force for now.
+	 */
+	Assert(areq->state == ASYNCREQ_IDLE ||
+		   areq->state == ASYNCREQ_CALLBACK_PENDING);
+
+	if (areq->num_fd_events > 0 || areq->wants_process_latch)
+		ExecAsyncSetRequiredEvents(estate, areq, 0, false, true);
+
+
+	/* Save result and mark request as complete. */
+	areq->result = result;
+	areq->state = ASYNCREQ_COMPLETE;
+	estate->es_num_async_ready++;
+}
+
+
+/* Clear async events */
+void
+ExecAsyncClearEvents(EState *estate)
+{
+	if (estate->es_wait_event_set == NULL)
+		return;
+
+	FreeWaitEventSet(estate->es_wait_event_set);
+	estate->es_wait_event_set = NULL;
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 294ad2c..8f8ad2c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6ec96ec..959ee90 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -102,7 +102,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 							 &pgBufferUsage, &instr->bufusage_start);
 
 	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
+	if (!instr->running && nTuples > 0)
 	{
 		instr->running = true;
 		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index aae5e3f..2c07095 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -79,16 +80,21 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	/*
+	 * This routine is only responsible for setting up for nodes being scanned
+	 * synchronously, so the first node we can scan is given by nasyncplans
+	 * and the last is given by as_nplans - 1.
+	 */
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -96,7 +102,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -148,6 +154,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -182,9 +197,9 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ps_ProjInfo = NULL;
 
 	/*
-	 * initialize to scan first subplan
+	 * initialize to scan first synchronous subplan
 	 */
-	appendstate->as_whichplan = 0;
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -199,15 +214,85 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
+	if (node->as_nasyncplans > 0)
+	{
+		EState *estate = node->ps.state;
+		int	i;
+
+		/*
+		 * If there are any asynchronously-generated results that have
+		 * not yet been returned, return one of them.
+		 */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+
+		/*
+		 * XXXX: Always clear registered event. This seems a bit ineffecient
+		 * but the events to wait are almost randomly altered for every
+		 * calling.
+		 */
+		ExecAsyncClearEvents(estate);
+
+		while ((i = bms_first_member(node->as_needrequest)) >= 0)
+		{
+			node->as_nasyncpending++;
+			ExecAsyncRequest(estate, &node->ps, i, node->appendplans[i]);
+		}
+
+		if (node->as_nasyncpending == 0 && node->as_syncdone)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+	}
+
 	for (;;)
 	{
 		PlanState  *subnode;
 		TupleTableSlot *result;
 
 		/*
-		 * figure out which subplan we are currently processing
+		 * if we have async requests outstanding, run the event loop
+		 */
+		if (node->as_nasyncpending > 0)
+		{
+			long	timeout = node->as_syncdone ? -1 : 0;
+
+			while (node->as_nasyncpending > 0)
+			{
+				if (ExecAsyncEventLoop(node->ps.state, &node->ps, timeout) &&
+					node->as_nasyncresult > 0)
+				{
+					/* Asynchronous subplan returned a tuple! */
+					--node->as_nasyncresult;
+					return node->as_asyncresult[node->as_nasyncresult];
+				}
+
+				/* Timeout reached. Go through to sync nodes if exists */
+				if (!node->as_syncdone)
+					break;
+			}
+
+			/*
+			 * If there is no asynchronous activity still pending and the
+			 * synchronous activity is also complete, we're totally done
+			 * scanning this node.  Otherwise, we're done with the
+			 * asynchronous stuff but must continue scanning the synchronous
+			 * children.
+			 */
+			if (node->as_syncdone)
+			{
+				Assert(node->as_nasyncpending == 0);
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/*
+		 * figure out which synchronous subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(!node->as_syncdone);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -227,14 +312,21 @@ ExecAppend(AppendState *node)
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			node->as_syncdone = true;
+			if (node->as_nasyncpending == 0)
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
@@ -273,6 +365,16 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/*
+	 * XXX. Cancel outstanding asynchronous tuple requests here! (How?)
+	 */
+
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -291,6 +393,47 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncAppendResponse
+ *
+ *		Receive a response from an asynchronous request we made.
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncAppendResponse(EState *estate, PendingAsyncRequest *areq)
+{
+	AppendState *node = (AppendState *) areq->requestor;
+	TupleTableSlot *slot;
+
+	/* We shouldn't be called until the request is complete. */
+	Assert(areq->state == ASYNCREQ_COMPLETE);
+
+	/* Our result slot shouldn't already be occupied. */
+	Assert(TupIsNull(node->ps.ps_ResultTupleSlot));
+
+	/* Result should be a TupleTableSlot or NULL. */
+	slot = (TupleTableSlot *) areq->result;
+	Assert(slot == NULL || IsA(slot, TupleTableSlot));
+
+	/* This is no longer pending */
+	--node->as_nasyncpending;
+
+	/* If the result is NULL or an empty slot, there's nothing more to do. */
+	if (TupIsNull(slot))
+		return;
+
+	/* Save result so we can return it. */
+	Assert(node->as_nasyncresult < node->as_nasyncplans);
+	node->as_asyncresult[node->as_nasyncresult++] = slot;
+
+	/*
+	 * Mark the node that returned a result as ready for a new request.  We
+	 * don't launch another one here immediately because it might compelte
+	 */
+	node->as_needrequest =
+		bms_add_member(node->as_needrequest, areq->request_index);
+}
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 9cde112..1df8ccb 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -364,3 +364,52 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanRequest
+ *
+ *		Initiate an asynchronous request
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncRequest != NULL);
+	fdwroutine->ForeignAsyncRequest(estate, areq);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(estate, areq, reinit);
+}
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanNotify
+ *
+ *		Event loop callback
+ * ----------------------------------------------------------------
+ */
+void
+ExecAsyncForeignScanNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncNotify != NULL);
+	fdwroutine->ForeignAsyncNotify(estate, areq);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 45a04b0..929dfea 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -242,6 +242,8 @@ _copyAppend(const Append *from)
 	 */
 	COPY_NODE_FIELD(partitioned_rels);
 	COPY_NODE_FIELD(appendplans);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 379d92a..823725b 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -394,6 +394,8 @@ _outAppend(StringInfo str, const Append *node)
 
 	WRITE_NODE_FIELD(partitioned_rels);
 	WRITE_NODE_FIELD(appendplans);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 86c811d..5568288 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1594,6 +1594,8 @@ _readAppend(void)
 
 	READ_NODE_FIELD(partitioned_rels);
 	READ_NODE_FIELD(appendplans);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 5c934f2..a339575 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *asyncplans, int nasyncplans,
+						   int referent, List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -282,7 +283,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
-
+static bool is_async_capable_path(Path *path);
 
 /*
  * create_plan
@@ -1003,8 +1004,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1030,7 +1035,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		return plan;
 	}
 
-	/* Build the plan for each child */
+	/*
+	 * Build the plan for each child
+
+	 * The first child in an inheritance set is the representative in
+	 * explaining tlist entries (see set_deparse_planstate). We should keep
+	 * the first child in best_path->subpaths at the head of the subplan list
+	 * for the reason.
+	 */
 	foreach(subpaths, best_path->subpaths)
 	{
 		Path	   *subpath = (Path *) lfirst(subpaths);
@@ -1039,7 +1051,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1049,7 +1072,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5270,7 +5295,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,	int referent,
+			List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5281,6 +5307,8 @@ make_append(List *appendplans, List *tlist, List *partitioned_rels)
 	plan->righttree = NULL;
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6613,3 +6641,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index a0b0eec..af288be 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3611,6 +3611,8 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 			break;
 		case WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE:
 			event_name = "LogicalSyncStateChange";
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
 			break;
 			/* no default case, so that compiler will warn */
 	}
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index d83377d..1c80e85 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4432,7 +4432,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		int idx = ((Append*)(((AppendState *) ps)->ps.plan))->referent;
+		dpns->outer_planstate =
+			((AppendState *) ps)->appendplans[idx];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..9e7845c
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,30 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncRequest(EState *estate, PlanState *requestor,
+		int request_index, PlanState *requestee);
+extern bool ExecAsyncEventLoop(EState *estate, PlanState *requestor,
+				long timeout);
+
+extern void ExecAsyncSetRequiredEvents(EState *estate,
+	PendingAsyncRequest *areq, int num_fd_events,
+	bool wants_process_latch, bool force_reset);
+extern void ExecAsyncRequestDone(EState *estate,
+	PendingAsyncRequest *areq, Node *result);
+extern void ExecAsyncClearEvents(EState *estate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index ee0b6ad..d8c3e31 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -21,4 +21,7 @@ extern TupleTableSlot *ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
+extern void ExecAsyncAppendResponse(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif							/* NODEAPPEND_H */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 3ff4ecd..e6ba392 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,4 +30,11 @@ extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
 
+extern void ExecAsyncForeignScanRequest(EState *estate,
+	PendingAsyncRequest *areq);
+extern bool ExecAsyncForeignScanConfigureWait(EState *estate,
+	PendingAsyncRequest *areq, bool reinit);
+extern void ExecAsyncForeignScanNotify(EState *estate,
+	PendingAsyncRequest *areq);
+
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e391f20..57876d1 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -156,6 +156,16 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 													RelOptInfo *rel,
 													RangeTblEntry *rte);
 
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef void (*ForeignAsyncRequest_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef bool (*ForeignAsyncConfigureWait_function) (EState *estate,
+											PendingAsyncRequest *areq,
+											bool reinit);
+typedef void (*ForeignAsyncNotify_function) (EState *estate,
+											PendingAsyncRequest *areq);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
+
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
  * function.  It provides pointers to the callback functions needed by the
@@ -225,6 +235,13 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncRequest_function ForeignAsyncRequest;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+	ForeignAsyncNotify_function ForeignAsyncNotify;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 85fac8a..48c7c2f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -415,6 +415,32 @@ typedef struct ResultRelInfo
 } ResultRelInfo;
 
 /* ----------------
+ *	  PendingAsyncRequest
+ *
+ * State for an asynchronous tuple request.
+ * ----------------
+ */
+typedef enum AsyncRequestState
+{
+	ASYNCREQ_IDLE,						/* Nothing is requested */
+	ASYNCREQ_WAITING,					/* Waiting for events */
+	ASYNCREQ_CALLBACK_PENDING,			/* Having events to be processed */
+	ASYNCREQ_COMPLETE					/* Result is available */
+} AsyncRequestState;
+
+typedef struct PendingAsyncRequest
+{
+	int			myindex;			/* Index in es_pending_async. */
+	struct PlanState *requestor;	/* Node that wants a tuple. */
+	struct PlanState *requestee;	/* Node from which a tuple is wanted. */
+	int			request_index;	/* Scratch space for requestor. */
+	int			num_fd_events;	/* Max number of FD events requestee needs. */
+	bool		wants_process_latch;	/* Requestee cares about MyLatch. */
+	AsyncRequestState state;
+	Node	   *result;			/* Result (NULL if no more tuples). */
+} PendingAsyncRequest;
+
+/* ----------------
  *	  EState information
  *
  * Master working state for an Executor invocation
@@ -506,6 +532,32 @@ typedef struct EState
 
 	/* The per-query shared memory area to use for parallel execution. */
 	struct dsa_area *es_query_dsa;
+
+	/*
+	 * Support for asynchronous execution.
+	 *
+	 * es_max_pending_async is the allocated size of es_pending_async, and
+	 * es_num_pending_aync is the number of entries that are currently valid.
+	 * (Entries after that may point to storage that can be reused.)
+	 * es_async_ready is the number of PendingAsyncRequests that is ready to
+	 * retrieve a tuple.
+	 *
+	 * es_total_fd_events is the total number of FD events needed by all
+	 * pending async nodes, and es_allocated_fd_events is the number any
+	 * current wait event set was allocated to handle.  es_wait_event_set, if
+	 * non-NULL, is a previously allocated event set that may be reusable by a
+	 * future wait provided that nothing's been removed and not too many more
+	 * events have been added.
+	 */
+	int			es_num_pending_async;		/* # of nodes to wait */
+	int			es_max_pending_async;		/* max # of pending nodes */
+	int			es_async_callback_pending;	/* # of nodes to callback */
+	int			es_num_async_ready;			/* # of tuple-ready nodes */
+	PendingAsyncRequest **es_pending_async;
+
+	int			es_total_fd_events;
+	int			es_allocated_fd_events;
+	struct WaitEventSet *es_wait_event_set;
 } EState;
 
 
@@ -971,17 +1023,20 @@ typedef struct ModifyTableState
 
 /* ----------------
  *	 AppendState information
- *
- *		nplans			how many plans are in the array
- *		whichplan		which plan is being executed (0 .. n-1)
  * ----------------
  */
 typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
-	int			as_nplans;
-	int			as_whichplan;
+	int			as_nplans;		/* total # of children */
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
+	int			as_nasyncpending;	/* # of outstanding async requests */
 } AppendState;
 
 /* ----------------
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f1a1b24..5abff26 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6bffe63..fb6d02a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -812,7 +812,8 @@ typedef enum
 	WAIT_EVENT_SAFE_SNAPSHOT,
 	WAIT_EVENT_SYNC_REP,
 	WAIT_EVENT_LOGICAL_SYNC_DATA,
-	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE
+	WAIT_EVENT_LOGICAL_SYNC_STATE_CHANGE,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload

From 117e3f2e0f17985af510bce9ab28a9c50f9e0b72 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 23 Feb 2017 15:04:46 +0900
Subject: [PATCH 3/5] Make postgres_fdw async-capable.

Make postgre_fdw async-capable using the infrastructure. Additionaly,
this makes connections for postgres_fdw have a connection-specific
area to store information so that foreign scans on the same connection
can share some data. postgres_fdw shares scan node currently running
on the underlying connection. This allows us async-execution of
multiple foreign scans on one foreign server.
---
 contrib/postgres_fdw/connection.c              |  64 ++-
 contrib/postgres_fdw/expected/postgres_fdw.out | 144 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 522 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  12 +-
 5 files changed, 588 insertions(+), 156 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index be4ec07..6247dc8 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -73,6 +74,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void disconnect_pg_server(ConnCacheEntry *entry);
 static void check_conn_params(const char **keywords, const char **values);
@@ -94,17 +96,11 @@ static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 
 
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
@@ -136,11 +132,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 									  pgfdw_inval_callback, (Datum) 0);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -158,6 +151,29 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	/* Reject further use of connections which failed abort cleanup. */
 	pgfdw_reject_incomplete_xact_state_change(entry);
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+
 	/*
 	 * If the connection needs to be remade due to invalidation, disconnect as
 	 * soon as we're out of all transactions.
@@ -196,6 +212,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->mapping_hashvalue =
 			GetSysCacheHashValue1(USERMAPPINGOID,
 								  ObjectIdGetDatum(user->umid));
+		entry->storage =NULL;
 
 		/* Now try to make the connection */
 		entry->conn = connect_pg_server(server, user);
@@ -216,6 +233,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c19b331..9d7eb9b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6515,12 +6515,12 @@ INSERT INTO b(aa) VALUES('bbbbb');
 SELECT tableoid::regclass, * FROM a;
  tableoid |  aa   
 ----------+-------
- a        | aaa
- a        | aaaa
- a        | aaaaa
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | aaaa
+ a        | aaaaa
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6543,12 +6543,12 @@ UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | bbb
  b        | bbbb
  b        | bbbbb
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6571,12 +6571,12 @@ UPDATE b SET aa = 'new';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | aaa
- a        | zzzzzz
- a        | zzzzzz
  b        | new
  b        | new
  b        | new
+ a        | aaa
+ a        | zzzzzz
+ a        | zzzzzz
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6599,12 +6599,12 @@ UPDATE a SET aa = 'newtoo';
 SELECT tableoid::regclass, * FROM a;
  tableoid |   aa   
 ----------+--------
- a        | newtoo
- a        | newtoo
- a        | newtoo
  b        | newtoo
  b        | newtoo
  b        | newtoo
+ a        | newtoo
+ a        | newtoo
+ a        | newtoo
 (6 rows)
 
 SELECT tableoid::regclass, * FROM b;
@@ -6662,35 +6662,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6700,35 +6705,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6758,11 +6768,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6776,11 +6786,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6811,16 +6821,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6838,16 +6848,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6998,27 +7008,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index d77c2a7..01b2398 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,14 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static void postgresForeignAsyncRequest(EState *estate,
+							PendingAsyncRequest *areq);
+static bool postgresForeignAsyncConfigureWait(EState *estate,
+							PendingAsyncRequest *areq,
+							bool reinit);
+static void postgresForeignAsyncNotify(EState *estate,
+						   PendingAsyncRequest *areq);
 
 /*
  * Helper functions
@@ -368,7 +408,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +481,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +516,12 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncRequest = postgresForeignAsyncRequest;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+	routine->ForeignAsyncNotify = postgresForeignAsyncNotify;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1372,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1442,130 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (fsstate->run_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_owner_state->run_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1699,7 +1875,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1778,6 +1956,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1788,14 +1968,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1803,10 +1983,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1844,6 +2024,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1864,14 +2046,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1879,10 +2061,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1920,6 +2102,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1940,14 +2124,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1955,10 +2139,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2005,16 +2189,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2302,7 +2486,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2355,7 +2541,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2402,8 +2591,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2522,6 +2711,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2564,6 +2754,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2918,11 +3118,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2988,47 +3188,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3038,27 +3287,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3142,7 +3446,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3152,12 +3456,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3165,9 +3469,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3298,9 +3602,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3308,10 +3612,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,80 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+/*
+ * Accept async request. Notify to the caller if the next tuple is immediately
+ * available. ExecForeignScan does additional work to finishing the returning
+ * tuple, so call it instead of postgresIterateForeignScan to acquire a tuple
+ * in expected shape.
+ */
+static void
+postgresForeignAsyncRequest(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	GetPgFdwScanState(node)->run_async = true;
+	slot = ExecForeignScan(node);
+	if (GetPgFdwScanState(node)->result_ready)
+		ExecAsyncRequestDone(estate, areq, (Node *) slot);
+	else
+		ExecAsyncSetRequiredEvents(estate, areq, 1, false, false);
+}
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(EState *estate, PendingAsyncRequest *areq,
+						   bool reinit)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(estate->es_wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, areq);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Process a notification from async mechanism. ExecForeignScan does
+ * additional work to complete the returning tuple, so call it instead of
+ * postgresIterateForeignScan to acquire a completed tuple.
+ */
+static void
+postgresForeignAsyncNotify(EState *estate, PendingAsyncRequest *areq)
+{
+	ForeignScanState *node = (ForeignScanState *) areq->requestee;
+	TupleTableSlot *slot;
+
+	Assert(IsA(node, ForeignScanState));
+	slot = ExecForeignScan(node);
+	Assert(GetPgFdwScanState(node)->result_ready);
+
+	ExecAsyncRequestDone(estate, areq, (Node *) slot);
+}
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5324,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..41ac1d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 5f65d9d..340a376 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0004-Apply-unlikely-to-suggest-synchronous-route-of-ExecA.patchtext/x-patch; charset=us-asciiDownload

From 1fbebf72e4aa57bbb4d19616eabfe888c4063e29 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Oct 2016 18:05:30 +0900
Subject: [PATCH 4/5] Apply unlikely to suggest synchronous route of
 ExecAppend.

ExecAppend seems to get slowed down by penalty of misprediction of
branches related to async-execution. Apply unlikey to them to prevent
such penalty on exiting route. Asynchronous execution is already
having a lot of additional code so this doesn't add siginificant
degradation.
---
 src/backend/executor/nodeAppend.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 2c07095..43e777f 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -214,7 +214,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	if (node->as_nasyncplans > 0)
+	if (unlikely(node->as_nasyncplans > 0))
 	{
 		EState *estate = node->ps.state;
 		int	i;
@@ -255,7 +255,7 @@ ExecAppend(AppendState *node)
 		/*
 		 * if we have async requests outstanding, run the event loop
 		 */
-		if (node->as_nasyncpending > 0)
+		if (unlikely(node->as_nasyncpending > 0))
 		{
 			long	timeout = node->as_syncdone ? -1 : 0;
 
-- 
2.9.2

0005-Refactor-ExecAsyncEventLoop.patchtext/x-patch; charset=us-asciiDownload

From 0e811309902e99c40159f0984d1e1dccfd419861 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 25 Jul 2017 17:19:10 +0900
Subject: [PATCH 5/5] Refactor ExecAsyncEventLoop

The compaction loop in ExecAsyncEventLoop was written in a bit tricky
way. This patch rewrites it in a more straight-forward way. Maybe.
---
 src/backend/executor/execAsync.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 115b147..173ee39 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -222,28 +222,38 @@ ExecAsyncEventLoop(EState *estate, PlanState *requestor, long timeout)
 		/* If any node completed, compact the array. */
 		if (any_node_done)
 		{
-			int		hidx = 0,
-					tidx;
+			int		i = 0;
+			int		npending = 0;
 
 			/*
 			 * Swap all non-yet-completed items to the start of the array.
 			 * Keep them in the same order.
 			 */
-			for (tidx = 0; tidx < estate->es_num_pending_async; ++tidx)
+			/* Step 1: skip over not-completed elements at the beginning */
+			while (npending < estate->es_num_pending_async &&
+				   estate->es_pending_async[npending]->state !=
+				   ASYNCREQ_COMPLETE)
+				npending++;
+
+			/* Step 2: move forward not-completed elements hereafter */
+			for (i = npending + 1; i < estate->es_num_pending_async; ++i)
 			{
-				PendingAsyncRequest *head;
-				PendingAsyncRequest *tail = estate->es_pending_async[tidx];
+				PendingAsyncRequest *tmp;
+				PendingAsyncRequest *curr = estate->es_pending_async[i];
 
-				Assert(tail->state != ASYNCREQ_CALLBACK_PENDING);
+				Assert(curr->state != ASYNCREQ_CALLBACK_PENDING);
 
-				if (tail->state == ASYNCREQ_COMPLETE)
+				if (curr->state == ASYNCREQ_COMPLETE)
 					continue;
-				head = estate->es_pending_async[hidx];
-				estate->es_pending_async[tidx] = head;
-				estate->es_pending_async[hidx] = tail;
-				++hidx;
+
+				tmp = estate->es_pending_async[npending];
+				estate->es_pending_async[npending] =
+					estate->es_pending_async[i];
+				estate->es_pending_async[i] = tmp;
+				++npending;
 			}
-			estate->es_num_pending_async = hidx;
+
+			estate->es_num_pending_async = npending;
 		}
 
 		/*
-- 
2.9.2

#51

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#50)

On Tue, Jul 25, 2017 at 5:11 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

[ new patches ]

I spent some time today refreshing my memory of what's going with this
thread today.

Ostensibly, the advantage of this framework over my previous proposal
is that it avoids inserting anything into ExecProcNode(), which is
probably a good thing to avoid given how frequently ExecProcNode() is
called. Unless the parent and the child both know about asynchronous
execution and choose to use it, everything runs exactly as it does
today and so there is no possibility of a complaint about a
performance hit. As far as it goes, that is good.

However, at a deeper level, I fear we haven't really solved the
problem. If an Append is directly on top of a ForeignScan node, then
this will work. But if an Append is indirectly on top of a
ForeignScan node with some other stuff in the middle, then it won't -
unless we make whichever nodes appear between the Append and the
ForeignScan async-capable. Indeed, we'd really want all kinds of
joins and aggregates to be async-capable so that examples like the one
Corey asked about in
/messages/by-id/CADkLM=fuvVdKvz92XpCRnb4=rj6bLOhSLifQ3RV=Sb4Q5rJsRA@mail.gmail.com
will work.

But if we do, then I fear we'll just be reintroducing the same
performance regression that we introduced by switching to this
framework from the previous one - or maybe a different one, but a
regression all the same. Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and
it seems like that will either end up duplicating a lot of code from
the regular code path or, alternatively, polluting the regular code
path with some of the async code's concerns to avoid duplication, and
maybe slowing things down.

Maybe that concern is unjustified; I'm not sure. Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#52

Tom Lane

tgl@sss.pgh.pa.us

over 8 years ago

In reply to: Robert Haas (#51)

Robert Haas <robertmhaas@gmail.com> writes:

Ostensibly, the advantage of this framework over my previous proposal
is that it avoids inserting anything into ExecProcNode(), which is
probably a good thing to avoid given how frequently ExecProcNode() is
called. Unless the parent and the child both know about asynchronous
execution and choose to use it, everything runs exactly as it does
today and so there is no possibility of a complaint about a
performance hit. As far as it goes, that is good.

However, at a deeper level, I fear we haven't really solved the
problem. If an Append is directly on top of a ForeignScan node, then
this will work. But if an Append is indirectly on top of a
ForeignScan node with some other stuff in the middle, then it won't -
unless we make whichever nodes appear between the Append and the
ForeignScan async-capable.

I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
/messages/by-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de

The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Tom Lane (#52)

On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
/messages/by-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de

The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.

Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#54

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#51)

Thank you for the comment.

At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com>

But if we do, then I fear we'll just be reintroducing the same
performance regression that we introduced by switching to this
framework from the previous one - or maybe a different one, but a
regression all the same. Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and

I understand what Robert concerns and I think I share the same
opinion. It needs further different framework.

At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com>

On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I have not been paying any attention to this thread whatsoever,
but I wonder if you can address your problem by building on top of
the ExecProcNode replacement that Andres is working on,
/messages/by-id/20170726012641.bmhfcp5ajpudihl6@alap3.anarazel.de

The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will
scale well to several different kinds of insertions though, for instance
if you wanted both instrumentation and async support on the same node.
But maybe those two couldn't be arms-length from each other anyway,
in which case it might be fine as-is.

Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.

Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.

Regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: CA+Tgmoake_zfucOAa3YEUnBSCFSXn8SU2aYc8PGBBpYy9fw@mail.gmail.comCA+TgmoYrbgTBnLwnr1vpk+CznWg7AgV9M9ehrq6TDexPQNw@mail.gmail.com

#55

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Tom Lane (#52)

At Fri, 28 Jul 2017 17:31:05 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170728.173105.238045591.horiguchi.kyotaro@lab.ntt.co.jp>

Thank you for the comment.

At Wed, 26 Jul 2017 17:16:43 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmoYrbgTBnLwnr1v=pk+C=znWg7AgV9=M9ehrq6TDexPQNw@mail.gmail.com>

regression all the same. Every type of intermediate node will have to
have a code path where it uses ExecAsyncRequest() /
ExecAyncHogeResponse() rather than ExecProcNode() to get tuples, and

I understand what Robert concerns and I share the same
opinion. It needs further different framework.

At Thu, 27 Jul 2017 06:39:51 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmoa=ke_zfucOAa3YEUnBSC=FSXn8SU2aYc8PGBBp=Yy9fw@mail.gmail.com>

On Wed, Jul 26, 2017 at 5:43 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The scheme he has allows $extra_stuff to be injected into ExecProcNode at
no cost when $extra_stuff is not needed, because you simply don't insert
the wrapper function when it's not needed. I'm not sure that it will

...

Yeah, I don't quite see how that would apply in this case -- what we
need here is not as simple as just conditionally injecting an extra
bit.

Thank you for the pointer, Tom. The subject (segfault in HEAD...)
haven't made me think that this kind of discussion was held
there. Anyway it seems very closer to asynchronous execution so
I'll catch up it considering how I can associate with this.

I understand the executor change which has just been made at
master based on the pointed thread. This seems to have the
capability to let exec-node switch to async-aware with no extra
cost on non-async processing. So it would be doable to (just)
*shrink* the current framework by detaching the async-aware side
of the API. But to get the most out of asynchrony, it is required
that multiple async-capable nodes distributed under async-unaware
nodes run simultaneously.

There seems two ways to achieve this.

One is propagating required-async-nodes bitmap up to the topmost
node and waiting for the all required nodes to become ready. In
the long run this requires all nodes to be async-aware and that
apparently would have bad effect on performance to async-unaware
nodes containing async-capable nodes.

Another is getting rid of recursive call to run an execution
tree. It is perhaps the same to what mentioned as "data-centric
processing" in a previous threads [1]/messages/by-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com, [2]/messages/by-id/20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de, but I'd like to I pay
attention on the aspect of "enableing to resume execution tree
from arbitrary leaf node". So I'm considering to realize it
still in one-tuple-by-one manner instead of collecting all tuples
of a leaf node first. Even though I'm not sure it is doable.

[1]: /messages/by-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com
[2]: /messages/by-id/20160629183254.frcm3dgg54ud5m6o@alap3.anarazel.de

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Reply to msg id not found: 20170728.173105.238045591.horiguchi.kyotaro@lab.ntt.co.jp

#56

Robert Haas

robertmhaas@gmail.com

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#55)

On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Another is getting rid of recursive call to run an execution
tree.

That happens to be exactly what Andres did for expression evaluation
in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
generalizing that to include the plan tree as well as expression trees
is likely to be the long-term way forward here. Unfortunately, that's
probably another gigantic patch (that should probably be written by
Andres).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#57

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Robert Haas (#56)

Thank you for the comment.

At Tue, 1 Aug 2017 16:27:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobbZrBPb7cvFj3ACPX2A_qSEB4ughRmB5dkGPXUYx_E+Q@mail.gmail.com>

On Mon, Jul 31, 2017 at 5:42 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Another is getting rid of recursive call to run an execution
tree.

That happens to be exactly what Andres did for expression evaluation
in commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755, and I think
generalizing that to include the plan tree as well as expression trees
is likely to be the long-term way forward here.

I read it in the source tree. The patch implements converting
expression tree to an intermediate expression then run it on a
custom-made interpreter. Guessing from the word "upside down"
from Andres, the whole thing will become source-driven.

Unfortunately, that's probably another gigantic patch (that
should probably be written by Andres).

Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.

- Tuple passing outside call-stack. (I remember it was in the
past of the thread around but not found)

This should be included in the Andres' patch.

- Give executor an ability to run from data-source (or driver)
nodes to the root.

I'm not sure this is included, but I suppose he is aiming this
kind of thing.

- Rebuid asynchronous execution on the upside-down executor.

regrds,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#57)

At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp>

Unfortunately, that's probably another gigantic patch (that
should probably be written by Andres).

Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.

- Tuple passing outside call-stack. (I remember it was in the
past of the thread around but not found)

This should be included in the Andres' patch.

- Give executor an ability to run from data-source (or driver)
nodes to the root.

I'm not sure this is included, but I suppose he is aiming this
kind of thing.

- Rebuid asynchronous execution on the upside-down executor.

Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.

I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 8 years ago

In reply to: Kyotaro HORIGUCHI (#58)

1 attachment(s)

At Thu, 31 Aug 2017 21:52:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170831.215236.135328985.horiguchi.kyotaro@lab.ntt.co.jp>

At Thu, 03 Aug 2017 09:30:57 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20170803.093057.261590619.horiguchi.kyotaro@lab.ntt.co.jp>

Unfortunately, that's probably another gigantic patch (that
should probably be written by Andres).

Yeah, but async executor on the current style of executor seems
furtile work, or sitting until the patch comes is also waste of
time. So I'm planning to include the following sutff in the next
PoC patch. Even I'm not sure it can land on the coming
Andres'patch.

- Tuple passing outside call-stack. (I remember it was in the
past of the thread around but not found)

This should be included in the Andres' patch.

- Give executor an ability to run from data-source (or driver)
nodes to the root.

I'm not sure this is included, but I suppose he is aiming this
kind of thing.

- Rebuid asynchronous execution on the upside-down executor.

Anyway, I modified ExecProcNode into push-up form and it *seems*
working to some extent. But trigger and cursors are almost broken
and several other regressions fail. Some nodes such like
windowagg are terriblly difficult to change to this push-up form
(using state machine). And of course it is terribly inefficient.

I'm afraid that all of this turns out to be in vain. But anyway,
and FWIW, I'll show the work to here after some cleansing work on
it.

So, this is that. Maybe this is really a bad way to go. Top of
the bads is it's terriblly hard to maintain because the behavior
of the state machine constructed in this patch is hardly
predictable so easily broken. During the 'cleansing work' I had
many crash or infinite-loop and they were a bit hard to
diagnose.. This will be soon broken by following commits.

Anyway and, again FWIW, this is that. I'll leave this for a while
(at least the period of this CF) and reconsider on async in
different forms.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

poc_pushexecutor_20170904_4faa1dc.tar.bz2application/octet-streamDownload

BZh91AY&SY�#��Y������������������H��0a_}��}%R������Z�XZ<��a�;j�3}���
�{���������m����������p)0��{��
��;����On���;������;��������T� #@_k��
W@
P�o���@���h�������9>=�\������}��F����s�*S�'�ex{������"�s�l���n>]��@$ QUQc�$;n"+�����OZ���v�$vyu�6�FZ���eZ���oFe����Z���6��{��+u-��=��Q8v�wY��=K m���#���il��H�����nB%uM0��X�{���OmGf���I��t��
���6��x:'�z�k���,���;��kIv��W�5��nm��[���V�;n����M�2�Xk�m���bR<m\�g{Q�����^{���[�.��E�V�M����T��������-29u�Gk5�Y]j�Zml��0��g���yz��f�����N�v�q�B�n�������m>�%���_�����7^����>{>���i���{h�&��vi�����Om�Q-7
�Sc����Wo5=��H�@&��BzM4����ez'�=G������i���i�$	��JyQ�E= 
�4���e%$�MC*y2��S�SG�Q�h
������I�j@��d����)&�&4�L�4h����SmL�����SO$��F@��P*$� FB`MO&Q�jx�f�jy'�j14@@����	��Z�u�~;iT�!3$�!B����?�
�������~s�7����r��o|�dM�~�H���U�P����/3�W�\��^2"Ll�?3cH�������H
R��Q&	����b�NT@��$u ��Q�H��(�!;�N!�@�i�V�{
(�x�K
��Y�����U��em+�e��4F�'8$�-:�����������Fo��+��_��]�}���8N]$�b!b��IKjQB��"���E5 @�uXX��Z�d�X�T�b(u@.�����@.�g(�SP����`��WJ����$��v�LP�m�C*��K�W����q��(�$��g
f�Y09�
a`q�U`�aB4X'hr�i�&)UBGb��f'NT���X�D!(����*i2����q?�q��26�Q��nu�v����J�A�*i���X�j��)$���m b�Y6���#v��3g[X'h��5,�"H��n�-��Mwu�J���"�����9�qE�TjuD�T�%���4�	j��r����9��q�^ahuM	B�DM�h���Z��Ds	Sh�t�t�����#�k��$"��H	.�A��#^�#L����K>����G�����,���E�e!�U������|�s����#�l9�!iU3����R�$D��?-��=�I��\;�ir��T�3�Fm��UhQic�]���Xf�]G��b�kl
������"" �;��] �]m�Ap�r�
���"�')AXT'hJ�@����zb���wU����C��;�3wJ�I']�VzvN.����Rwr���������'/[��RC�r/'dE{��F�ENf�*��N.Y"%8����/FHm`�D��vQT�9G����9�D�h�����5�S'v���Ob/���3��E��e��4���{Rd���)L
LQ����~��g��*���^�������D=�w�kz(4E����;���0�3�z�Q��Z;���HxM��	��)Q^,{<%�-���Oty�nN��'{�;��c�<�*�)��<��VdUQUT��I��	�nyk��Qm�Q����4�o�����������C��|��#�����
8�;���n!�����1��������<rUU���{�s]�s���w=w'c��s��'dR���vI^���=�����:z�t]��t��)Us��x;��O����.N���|���{�\r��!�$��1ec���"�ii����\���d�{��	�����-��Up��D��<6n[8Dd�C���3���k��A�S��QV�B�����*��9A^���I�+2 �+��3�;��y.I�9`;v�aN�q;X8TUTqJ��8�u]�X�L������wb2nV�*�h9�����vu����N�=n�\59�
A<�AA�Z���2hDu��m��J�C��:,v�DU�rs�
s��Ws�"��]nI��V�����vs!�
b\HMX%��������n�!C2����;���RV����"�&k1"�����KP���4F�jH$����tc�e�%H� :C�LR�hEW8���C~�w7�;�wM��f�Dz����=S�Qnt9#%��������y��X���?��}�(�	a�����k0. W;����e2�� $������o8�E�aCq�!�����B ��Y�����Qgdp7wl�"�pV��N:��t
:v�rAw9��{��X	H�A9A�i�+��t�����NB����N�O0�U@� �S��������M�Z��h�1�D���&�C��4BR��g�s�	�]�W/�M���r�
s9NJH�����u�W<

�)7u�Ew�����������J�8������t�v�"�8���{���Grq�����'c�A������k�3���;����vtwD�������e�Z]����7V���'��i�/���+_�� �����g"p�N�����v9'e��4�9��N���P9��Q7G#�TT^�:�z��l��!a	%�Ny���8O>���9j(��>}v��	�$vKe�h�	�:�
���QtQ.�D@P�S$�T�UR��]5UU=���*�G>v���9J��prVQq�w#�$�vY��������N��9�9$2Q���t�����cA�1#�8c�����a;8��t���"�(-�������p�r@Sh��|s���dG�pN�(��)����� ������[ui��M2]38���F$���J��*�DAL��@�d�+fMiP�������S9���h0���3 �*���u������9������t�>��zzR�'�����������������������E%�����������}����A 2y`L��sL�g�:Z�4�&<'Wa�	�D�Pib�}�����-����IQ�5���PT���np�U������E�Z��� ��0��Z��A�'�����D�6�Bo�,P�=ZDy��P�C=��� ~��z��J.C�!��
��b;�h7W
Hng��6���&���!�)�R�?A��T&d]�$�2$���dX�|7UW
���u�������gEw��:���s��z�>-�5F��kK-?�Q�=���~��2Z�H�e{�QD	�w�(c��
�-2+R?���	�B;�,�aTI��r`I��z�DJ(L�X�AAK&T����,RZ�n)v����>�2�@}�1�e�9N�DQU'��<���$/�������������[�|�KKBRRt�������^q$;�CCcU��E���S��6������������q�-`��]�:��X����\f�ph ����i�����=�zQ��?o���}y��Cz��.A�4s�y�����:�
���/A���{Z�����X��(���> l|B#������r��P���b4���S��f�A���K(�F�=�0P"Z%Imy��Pe��`2��I������'+��{���q
�G�u��(�f`#����^��8_%o�6����S���Y�.xC��ZD���s/%+��JNw���~ O2���t���cx�j�����V�$/�/��x�b�oJ8����G�"����;�s�*�_�s���>��<�I4cPe6�Y�x���j����s��s���r�VF!�C������E���Uk�GD���w!�������������;��z��}�	h��Q�wL�$?w��.}=�1������2��`��E���T����X��#���>G$;�*� ?|/3V�g����4R���O��1��'�a���g�JZ)*����B�������<�����{{�0����/�
z-�I�������W�����u�	4�R������s���`�_d�wt�34��OI-�bq(���e��
��o����������[�OJ�0��m�T|��>���D��,�M h!�T*�N�!�����yt233�a��`^dr��Q�8�3�X���C6d�E�\�S��v�YS}�{��n>�];.i������s�`�z��G��?���W��'��<����_���;���{�v`���Ymz��m/TDLD����Lm���HW`�
�����v���	�XL�|N3��)�[�%���4<��7�Z5���
�/��Lg��\Z���>)�
�\kzPdAK|��������V%Bp�J[u�[��P��xr&L"�BP@�a����i�h������=~��g\�|�=qY��\���h���o	r�x:v��"�K4��_K�]!�} ��I��N	0A��nO�������v'>��A#�;�{h��b� 2���N�N��@$��A�'�
�{��2$4�FBaV��1�8���4��jC�Fk$H(H��H�l�Q���u�@��c}��������Hn�E���#�_d�5�7s���*�u�CD3C�z�=���SQ?&�y �%M��dJ�&C�d}$& � A"�ua>UE�F1��yr�l��@9��P&	�D3���X�E�(��i�Bm�3aX��WJ�Bt.$(5F�]lETsf)L/�}x=��_�������� b!���c��S�Rp5�z��y8~�����~
��;���� pS��-.����sw�����H�p�n�&'��N<}���8B=n��������|X,Xm
�@��w:%��[w��(�<�;�7��]7��S��_�g��"12,�������|��������.��(�TQ~+';����x��S���~)���u���[�{�w��-^�}np��&�@<�XLK�;$u��J���k0G����r`�F�6n�<��yq�N5�����0y��Trt�JP�^D�RR4d��-!�X���mDT;��j{���F�w$k;�-n�a
��)�=�=v����B>���V���o�����]�������?����{�}4���x^��z$/V���|{\D(�(��1������lMq�oe�iYKC�+���t�"V���G��I@�O��������/O�VUTd���*��_��gX 9e���)R�hAa���w����Yt��\/���TB�V����*	�)��������P�����-���4�������O��
V���C��tixDE\wGZx�+4I�e�
�� �M8dA�N0�(�5A���/�wM6����\%�Wp@l�cD���T����d
�<���,��������FL(�96������a	����I%���	����i
��F!��
h��P��d$]�D#0���_�w�m��_7�!�.C����v�������+D�y^�`x:���_*xz�V���}1�WO2-���������]����)|�~��z�W�\~�{�|6]�KC.>�����^�����)I2�M�WF�4Xm
�� r$�&$�"�	 ��.���w;0]D������b������������u��(���:�
�<�%Dy��1���c��3-~�pB|t��v�����:=MD��\�����ps�(�]�u��;���d5�<�`��k9N��@��/�����������JV���g������%����(+_�x�D\'��{��Hgn4D64�����?��(��o�_�Fu�v�R��P#E|_��������7��]w���~��g�o�O�dMA�]w�IH����g�Cg��{]�r�����j��tP�~w�S�<������S�,������\�O���6����\�+T��$#���o������D����{���6���N�U��m���Mt��F��L:ug�������1�d=���a�~�3�%�\84=���RR������c���z�X�B��5�rV�~�8)?7�|����u�m�n
��A��>�������5���jk��l��������x�[���#?��]b��s���Is������go�=���
U �����`�I�c!�AL�1�!�QDHQ�v�x�]�zP`:#��o����������s?�0��K�O�'���~��{��(���3��f;�5)%-��#f�t����4�otE��$�+���V�}���
��Y��c�jn�=9o�R*�$������b��H��{;��L�?_3F���?���	���{��8+��k�G��P2��4�����=����r����n�UU��c���m����|��Twe
R�?����?�rc5S����+5��U)��(�0i��`A���[}tK��u�=��H����e�����ek����+R\����&����HK�D6������P F��a��G �J�%��k�����p�1>������'N�x���C�D�PL%�3C�%r��"�'���_������'��5`�v�iN��x�w:���U�� H���8p4
��b�jk����C��7�$��b�-��I��f	/y�&�F
����O+��M�0I�66�c���L�^JZ����Pvq^N92&�c�����6R�KN�=A;���Z�/U����I��I��nw |�� "{�N��p_�������y���n�����N��6�!���xl��P��\�++QF"�&�\Q��LE��!�4��5\p�k�����r��X7q��'�1s�8�
O2
�{��O���k#�"�]�7�����n���k�(g�@������I�Z���1j�Y���F����8���n����&4(`���4Ti�q���:�N,/���O;���A��S��R|q�
Vk��i^�Z�� ��V���%FE�}t�g/5�(���{`��
"#Y^YMw�u/O��ia������P��Q���Z~L��r��c�z=�C���������q�������k��]��C�H�����?5��d�1R�q�~�,]�&az4{�j���W�*g/���i��:RDG}��+�����Hj����3$PQ1lk�(d��W��U��H&3(�t#���^=�W-�G������"=D}��'�t������������j���i�� ]�b>\�C���N�G�����E"�m�������96������%r�HT��EF���h��ns��$.��<*�BR0Rr:��R����]���64e����%�hB�P��u�����3��Hc������Q�Z��T�J8cH31�.-W!C#�����a
�iJ��#�����q�S�(�����K�1���Lh����o������������/���I�����U7�Vs��N��St����������Q��L��eR�n\R�a�5Y%���s��!YU��*\a4�����dYe�,�J�J������kU���h���;�Y��{����$�X��9�� ��%IU���tKr@�$

��&b�A:F��0f�;E���D�0K&��UJ)��p�D9.����I-80	�oe1U, h�I$�6��92��fM
4�0	�D���:b�xyE,����'?�=�� ?@����!%�@�����C��������y���Y�~���B!�����!�5U!�O��,��?�A�����)���N�_����W���ad� h��Z����s�x8��w���[}m�}F~���N�o�8�kX��m�1�&	�m����������6�M�/qY�oJ��tz���/���9�M���������T�������6���d�����T-r������+osB�5Y�g@�?S����@_���6�b���%�U��NCl�u�|�f,�<�v��%���d8FT�6����6�4J��N������E�r'�~�������v����k�����S�Q�2
4��!)4&�$1"#���q<���	 �hMP��(�|���H�a�����8��P��1A�D+IR'�4:���OY:J��4���iRk&r�@b�^I����c��+XlYr��	-��Y
����%��2R�;��B<3����5r�5-A:f1J�w�Q��*�)%-%d"Q15��'��*�)R�~��w�d'����\���Tl��Gp��������H���`t�Ix�mX�u��um��2��ch��E�p������J�a�;8\������7�6 F�B���Q�e|���y�tWL�>Y���W�����~'��TH"H���>g�q�fF���"��s%����A%u��)�V��tz��������Z������Y��3ip��a;���bN�1�0�*�p�]e�;����|:��R��4��9X��e��F�
<8A,�1\�(�O%���e Y>h�yE�K�[t@D:s"�����\O����q����)h)*�/�����q�.�sIc4��m�[85�W=��_���z����y�����H��v����zW����+�n1d�VS�|�#�����N?+�V7h�������.�'.MJiD�+�
/���siW=�C���.��1�w+���lTi
&
	n����b3C;wD�Wf"G��:�X�I��m�]rdD�W���$�$��$�I&JL��v�����C�"�DFL�I$�Z�������w��I*�=$;xw5���4��f���V�
�n��l��
�Pm�,#	=�J������c����@���d���i���,��/�Z�����F���"�� R��G�R�4p�Dz,��N�={�,���z��Z��c�n�/f��C�Lq^r�����@x;�����	H`CICRn���fd��%�LEKT�B=v��^#���0�1O������`H`�@�K�e~����������b��Z�y�Q��]|�='�U��|�|��'��i%@i����e��y��=�H�&B�h **N�������|>La�p@�IuS��
!.7*@vI,�J�*J�p���|��x��v�k�Xp��	e��I\b�0���������	d��'�w���L����]�Xj�/.��g��H���]/�����b�q�K
|cO��oO$��4�m��7�AI����cs�Y��<F�^�i*>#������D#�����N��Q�sq�&�� p2V��,�[��?�����F�2�c )_����O)�����x
��G�#1�$��-�@���W��4-!*K��O��������-�0:�X�����[�/�eX�HI$�HI$�I$�I!$�I$�I$�I$����g���������8 �!�P���QR�v}f2I�h���*�h�����-�!�s���H2�{��%=w��I�W�����U��|{B��#\:v����'�����g�Pa\�V�~��z3o�zUs��8�9:M,��KX��]iS�*fH�z}(c���`������64�of2���YI�B�[n���a��Q�GIT%��kR�v$ub���m6����@5�dB���_m�]�:��2��R�$f�9(6�i�]�6�Z�����o~,*��	$$��I$��I$�I$�BI$�I$�I$�I	�7�+����_w}����g��awNkFL�Pa	N\���l�����������h��nv�*���v4�@������Qu�I
y�����K_5���9�B�������}o�����:�P*xY�������9�b���-j�����U~�U��`j����J�}����{��%��UO=J��2�J�0SK���ZPqKP	�
�RJ�����Pe�2@��}���7��"�~�t�$VW��&�znv�P��b"��FR��v,J��.z(:IF4���uP�nl��I/��H0�$V��#d}&�u�{�������G���c?���o�/��[~��{b����R}>~�=����
F���P���?��QT�
-OR6*Aj���P\|��MhBcY�*tq�S�C/�����8�.�B��
&����sS��	TM�Ci$���|���W���k����:[��
�����D��a���@�����������?����t:�-�f&���+3J��nD�E�S���#�Jz�]�>q��PG��B
@�X"$���))�[�D�U��|�w|#���R�)������b��^I�5I�eRIR�RB�=����������A�
�(V����0lx�9tzj�P�G�����cccchs�/���S�-+���L{(��Y1=���E��v!SHAc�m�z���{������#Q���V@����Y�����^��V�����,k�$d��Y�+LU�$;ho:CJs�r�����Jn�p�ufi���!/��8��t�.jJ�D@KVhl��gcE�d��o:�)�>���w
�N�N�./5��<�j��5���O���Rbu�%�B�����k�.2�Wi�3������Z��Y��zm������w�a���es��J�A�z���x6:�uOPe$�r�r1/&���">;}�����x��L�[������G�gT���<_ajS��H�Bl�.������U|�<����-�
D�y�R��5����l��i�X�i*�W`�H����-
��a��������(2�}~+���l�S��t�����B�t�L��xl4��k��L���qh�A���P ���8�u01t&���Q�����N��������O������J����M;z{�$�>�u��|J�<�<�QR�R�����)O����H\w��%O|�&=��s��m_��s�@���(F"=�-'�
%_4�'�
�3[��,���}i�e�r�h����S�fc��z�y�%:��1dbo���3a�
�u����u}�~>g&N��o��p,
�I���#�*�EG�"��Y�m��L�_1��~�����h�.m�<�V����y��K~Q=�������;��0h.L
�T�i��{EY���!#��s���`���<�O��3�iyJ�Ka�nH���9=}�s�s���g�r8pt�j:�X��(�84���S5k���<�h��9;���erE��I�yRSC�� �����h��:�L#y
*z�AJ;j]I��!�~c����W�y��$SJS(*�+�q"��6�#����ZC���H���:	�0�gfW��g��?���El��+;XR���_DgMe���.�t����4Q�<r;6��a���X����e�z����O���c�d���~��q|�rQ ��>�F��)
�)U$��UI��.�-[���]oke�T=HF+H�Q���J0y���i���G�2��V&Vd��Vz"�E\�
A�����Qb�L�����	$�XU��1:�yaF}z��NQ��>n�.o�X��
x�qd��7R�!��Rk���"��Q�0���_�)������_|LlF8:�h��5e,���L(���)��&�r.���kQ9��|��0��]8�H���R5�:�]��]���w54geC��E��24Z��7��.L������H$3�@1�=�������������l�+�~j��������^����g%�s���?i��C�hc��@�������y�C�tj�D�r�����5�.��-z�s�jt0A�H�K�:C��T�*��R��D�O-M�i��^�9���w.6H�S`�
��l;��Qk_���ms���lGG~����nJ|����i�A���x"!��(TF�m[��O�U(�z�!�X�|�o$U�H��b��A&K�V
����D�<�/*�xdibT�^�[�xQ��kn�zA���(�6�'�2XT���$%�;��J'T�jK�gR�
H�t�T_m����>��8-���y�����h���f��N���rt?����H4/4k7���X���SE�5^��H���"�3���6:�$<��G����M$��pC |�������m�:\2�HR��>o�ZM�<Hh����,�A�K�E�2l�(�J;��EL���"�U^�'��i����<�k��h.��G�Ylu�b��������e"�c
��}}!u�g������B,g���t]��<X�TW��������ac;~���z�����*j&$�in�W��<���C�S�<R���|��7h���i�H~�?�E+�O��U>%���`�U$���$u.
E�}o~���a�@h�M���Y�T�u��G?�v��m��tg�O��S�C��O!����3CZ�6��j�@bb�K���P����-�R`��f����VA�}�5���Wz_���J�����5,I�	+�n�sLG�f7g�s����6Lv����:�(���zk��jjT��&�,�^�D��4 �:d:�`ea����C
S9$����i�6��w��� zI����Gj*$G^�N�b��U���qs��\��f ��]GkU������_�����9�q�N��8hRh���n�2I�eo��QpT��Q�f�I������
��\��X��N��8B��V��2A'�������]^3;2It��o�p0���������v��5����JL��hW��&���RMs�FI���flmB�����4C'� E��L5��K&����_VjUs�L�����2m������1�f�,�ht���W0j�B��hp>	3U!)��3^��(�2������I,eiX��i�
����fT��\��f������	������QS�Y�>@R��� &.ST�T�Yo	�e��-�J2�����	��M�(h4F�R.�Sb0��fX����o�v�F�y�uN	Df�$,P-_��q��44�,|]@���5$5��7���Z@O�]d�������\2��4eF`e��?��]��C������*9��[�����'��&�>�clbm�HQ����:3�;����e\v(q#���;�(F�����jaF ��R�I�$�TN�Wa���T��I`��h��;����$��	�6��Q�
�)F ��v)C���
5Z�M�����	w������T0l�b
$�h�T�F
(�(�z\�0�>I.2�am�0X�k6��d��D�\Ak�!�P���nJ����rB�-�%�\��Y/����B��QV�6�.f_e�nL�A�b\��Z��u���SF������^��q0k�B�G�n(� �G6#���������#��+�]���_���L/��@��EP��F���cj�I�vb�*�$����|wp����}���l���3((����9�M���Z��1"�������)�~� ��E���?��N���
�����Y����]*��z�$���rtLHD�2�)m5^_����?�{8F"#��ER68�,�~���n�7�f�����RTv�G"(!b�Jm�je�V���Z����s�����G�������P=6T��D���8�i�N���g�
<�P�q��"{�H����C��4��.�}1���U� �W�k�!��Q[��#u���v;]M����T��H1���$Wu������HJ�>P��m�����,�����[D�v�n1�e�= �.�*����*��$3�@��B���A��R�(���l���RW�W������������	$��}��|W�iB�JP�dcM�(�h��L��@?�J"B�J�
+B�
3C �����3�GK{-���(�&j�s#1%
��]�K��d�t�R�@�	LL_�LR	��o[63	�
�z��@����Pu���#v���W���r���%^K����3,�i&Cq_2�6��+��[S�c��J(�r�	B�a!������a$:h��c�Tun�*�T���T=��bA�1�ch�|m|Q������ri�;��*����
�rPd�����P�@�R�*U���%��I�Y�F)T�tS��.��[cd��$�,*8eF�&U	�� �i����0Z��pi �������L�	b	@Y	��}Et<��
������i��$s����R}�����v0��a"�M��(��b�{8x�����C�DD����r��S��;����C��!�	�+(P��J]���)E����H~�E����f���y��]����eFH����@��6>H�}��EF6[J7*�!7��f�&I���F2�����v8h�Q�<�k#���U��te�C���������w��������q�Qm�6,UU?4������%l�i����

K��EjU��JD��(���C���@��>��=���V���{(��59A|+�K��T�Tp�r;�e���
��;$�Vb�dT!�7$J����Dj���h�E�C,*
����ER�PI)���sS��J����.�y�X�'�F��oCJI n�����E?�����5)�{�>}���Uq�SD{3���z���v������S�zo������m���+F��v�.��)�6eb\���(�	�jy����a��{�ZL���Fqh^L-\x���Fn*�$��`#��u*Dvvj���ZE$1��5��zF*��������[���48>�mx����V%������O�%�U=Z�z|}#����C�U��t�y����<0�0�3��80�%{��w�M��8��r��|<������u\9S�H���E��0����vy����������6�/j��yc�RjF��)�"��F3���hm������U�e4���{���_0{x����.Y=�T$����cH!	&-�Mi����9JJ��'�S�,W���\i��S��Z�a2�\�fQ�1���e��J�I�J�u�I D1��kO��c~�k8��U�(@#3���XP2���!� /)E�
�"���}G�����w�f&!���@��Q)��7���\����6�|�G��v��,$��[|w���==9�{����	����8��h��o������o
��������o7i9V.�1�u������$=��P�FV��hO�}U�>��TAT���=f��I&8�k������E�&��Y��G�0��3@LD�Tp�.�MG�{���YF������1�i���;��u�{���FCy{u}����?qz	9I�rO�j���L�$%x��L�=y4�S����84|R������s{<�%�\S����������d�������s> �����)��GX�o��[n�n����%Z25�C�{�f�������_P����\�=85������*eKR���,X�4a�����[��'������"��4��CWNq��V{�G�X������v�-�u/F�C�V<���l
4�`=��MEOLzOU�����I���o�p/��Y�I�6%��}��@� >��Bk���L��h$������(*�,�?��������b&���8~�Fa���	���������Ms�8���9c���L9c�4b%�.=�"�f����+y�S����^m�U��G�������H/n�F<�
�+���}c�X�yNPT��|RfZn�-��6G�4rY�����(��Em�>�H�E^Y� /�����5���Q�
�UIJV 5%T����Y(�������Z�$��mf����e2A�6d�M��5M�e�f4�i*��%4�T�K5--I�kf���R����YLi�J��l���$��L�b�l���h�)�7��n�����2`X��l��DV�Y��e(�E����*m2�&i�����T�d��Z�
��� �U���~(ES�e�&Q��nX�2S#m��8�rcw�b#����?��@�%�b2"�
8��RT��M��ht��(��{}<&����<�c������v�����,������_��{������f��+�w����sF��lcV;��a#�`�Z����(Q)�F��j�GD�`�mMb1u�����$2�`k�{]�QRJ����T`�d"��*`t��{��Uu�)>���4&��O=��gZ��k'<�������V!��
�1�u��gF���g��
E�1Jh����r�:�b�vXG��km�>���������5	����!��$\�������x�2B�_zk[�ZlHl�U�>�F''��.���X$�������D�s�P�5^�Do;(���Sr��?v��aW��q���7#6/��i���xt�#��v����e�2XYN�E��(*��T�o�?���g�]�5�-J+�JW
jP6�.Z>`�.51�������:�I+A��������H�mA��_7�b��y�����.4%C�B�Qv��D4��<lKR����`�_�n�<�p��`���]�Wr�W|�����TT�I���]����}M�3��1����������O�6��n��O������������8�!l��,���c�u�ZsN�X�m��W��#D�h���f���NK���%�nJ���yf�
d�m�5�������$�R���fq��4QR���A��d��
r�ru��.tUk��N�#���&�.�AGs�r���W{�y�y��+t�0��I�+���9j�e�\cj�5�N�$�iIt�Y����nrmp����+�#m����5v���<�X�Cp]���Li{t)�����\��qqXB�q����o������'��
���\��r4P�>E�� ����,��V�*E�ps�lU�uR����q��
�X���m;:�7pf=�S���S2�|�<����A^�����#��Y�R�x����O�L0y|������g8gl��V�����q�n\m��e7F���I�L�nW,���\������u�����Y�������7x���ikW7�:�����������?l�I^8\�
���_�%���?<��qa-H�HTyT�:�M"hv��G������G����?��L���Q����H�t
>0B_�@U�����9�A�~@�1 ~��0���!AC)�C
�T�J"a/8Tp/6^%[|�Q@&�B�r	MI�N!p'Ie���������5N.TmFM`���0����� A����&$l���������p��U���I�o��k����t.���z�\@3H��%M:B������
A?=`����Y��]?��x�{o������`~�jz�AQ�I$L?%t'%�_�6<�M�����D$&���34|�
H�Z3���W���U_��b��{�)�-��@fO����+�|\|�j��}LO�F��z:�,mHM3�n�c�-_�N,+o+��k�hg���m��������_L��kv���X������-��R�;�/Y��FO�^�����|�{��t�	�x���N$��A%���jj�cR�O-���zI��uhN7=O��t����:�z�.6j"�F��� ��S
��,�;
}�j�cb?���~������>�ma��w������h���P�M8}y��� ��'/��
a�O'�o��mp.0s\s�w�Fs}"�8�c��J�- Bm$�L=X��a>�~���%��;��+��8��f��t}�����*��&��W|u��0�5�"L
.����!
���eY�[�(��)v�=b���H6�����It�"�"GG�l������xX ��R(��&�����R)�{t�"LQTu���kq�F���}O]1b���r�B��a�4(-��(]�p�S�����2�~3W�c�Y���4�g��;xhh�z�tun�v���;����"OB�b�s�#J�Z[�\���4 WB���!�B�����T�?X����+�xC����G!��\�c,��eE����dSq����n2evPD^���WX���
;}�oX�8��E�}d>�������;e��un���(J��C��J�����>v�E9�;+?L��Qy'.����@��i����A%EAPTP�j�@�5E(��K0��.o����������?�8��}�qh��7�������������|�t0����H���%��S���_����8LH0 �DC)Jd�R�2��)JR�2��)JR��)JR��L�)JR�)JR��)JR��A10LDA����>_7�A>��*�������$P������_y�EU������������RwVD��&��@ �`��;>6{�]9��N����y5|�����6l�Z����@v��h�hV�����^���
������@@���
�oi%��|�������o����_d������?�����?T'�P�z}"z|�
��cXh���=�����M����\����B�?����M�\��C.E>���X��$����#\���@�M+�T�s�����^-��q����\R��5/�0���������gu�Y����}t��{�Y�h���3}����L�Bz��Mm�����1^~���,y����03�����6P���y�F1�u@�1��n�t[�p�Y�jH�s�n�P����Nk��+v_������[Z7���t������\9�����Q���CH�<t�q��������z�����4������DCa�1�M�u��X�."	�o[=�2���Q����������b����;���5:�Ak[��������=����i�@�1L�j�di���n���d�������\�����K�6�Y��6�m��d����mB<�-m����m�
���+>+m����m��Z�U�bk�R%��=x�����)���!��>0T�s���R�w�������31+��\���\@�5��h4�m����u�1��;���kP��h�[`��������Am� N�q���m����>���,f�5I��@_=u�w�1���h��f\��5��/�j����������V��f��V���
�m{"H�A���m����	K��;��MoZ���������	"#����J��<t��=��n����z����b_�M����\ (�4|A���?��{��O|b�G���i�'|3G�x�H)�����K�o
�N���w������z _�}�'�8E)	�7���zu�Y�������C5p��p������-�|�/P�������Y�y���|t��x.�y��*�v�^V�������W�G���`��������k��������P*a��P^7�:���
�,�7�Gl���H�	=WY�����H��do�<����!���}�^M[��m��rp�U{�g�c"V�� �5�p��}zyr�����6��m�q^��*�H?12�U�cfsS�p����i�����D5Y�4(y��CP���m}6�-������������"��jkY��w��N<C�9�s:K'x9������B�zv���w~
/7��i�GN�4������$�o!��}+���5���
z^�Y��v���4�m��7X���jDLt�W}��4���mr���2������G��}1��|c���I�jo{G����j���������_���3����_X����/N�D�+�w��^CLQw�p��$f����������NW3�ll<��YIY���M`�Y7�hs�9�<�[�3��;C>����1L�)Z�4���=�-g:GG-cK ](T{6������� Q=��~�f�����m�1�$���v�p���X�M57��gK�38��m�M�#�7���&J��O�����[�v�s�:st��C�KM]9�f������P�g�'}ox'U1N���y%�����6#|��'f�E
���a������9F��������:�����<g�31�f��;nw������>Y�����ZW�74�2��O5#K�^M����:�������;B����t'�P42Xx���jUwi�i��;Q�1W.�	>�[���Z�\�-*�ZSWn����g0X8�7_.�NUa|������Y���a?0����?/�6+��`�GT����D��:��3���B��z~`���hs�T��dl�7�����F�]i��W�C�F�}������hf�1���o�i0��w�C��9�h��rR[n|�jF�k�}
4�T~q
��e�l��m��N��]��c{t@����x��]u]g�M�2=4&z�;t���Gp����������;��==��y���S}��`�;�q+���C�d�hV�1{k����N3���O���b�w��V��x5p5F���Q��o]��hZhO�k4��Z���H����
���c�� ���_y���w/H�k���G3�=����������)�rG���:9���>ix��y���<g�m�i�D
)W����M��g]f=����`��/Tla�A�=:<O�}o����O�{�P�����=M*SM5z�o�..V��WM)}����~n�!j�����@��=����^K����fro�c)�JSh�b�NqV��������Nt<{1�q�yS�=u�B����})����5���	�_Q��C�,�Yh<B@e�1_���{����
h�p�/���|_��^��`�<[��"�0m�\�W5e��x��TG.�
���Z�Y��J����-L�k���5�=hSh4�;��z��j���1���R�zS�1r����{o{��_�/��������K�3�������w���TG���y8��c�J��{�8���55���������=v�����@};cV[�x����'n��i��������(w1���y��]yp�/����uYo���v~g3���F��_��W��;�E��mt����k-�x*L�k�w+�1�?c�J$<x<Q�D��L��]J�8#���zy���S�CB�E(�" "" �^�~��H�g�y%�Zh��k�����Z�^^\��ei�n-��R�o���N�mk��&�U������/�|�E���x����lh������^�i�5�Z"���M�W�����H��g�dk�x~��6/�/����|rS#.�=�nk��x�z�2R�K��N<q�;io*��3s�g?)<��=��:M��i�M>��H���%�m'mj�x��c�v�����|�/�D�^��^�M���\rI������1�n����:y�{C�1Cv���O5�l/c��P�4
���$�<��z�_�E��-�,�%	[����b�����6�s0��������I��x��Z����������^���=2�z�����M����x�� �nw��L��'.�4��#�7�t{s�w�m1#���%k�W[)s�
�X��-=zFm}�3�8���x��Kc�s�7��f��N^����o=���Wx��zw<�9�Lu\q���;����l��5�k�6�W�[��:i��u���N:W��z��g:�m��V��bQF��|'��o��!F���~���n����a����7��n�9���5\��pQY��C�g8��/K��s8�k��o�8��{1�y_�3�������Q�����a�<��b��3�:t�/~�'�����Z����ZS��5�~|�}���\S-������u���l����������Z�c9��^/c��|^����M+�����7��~5�����N)��
�u�4����n���<��c����-�]�����-�����2�s����������������A���q:Q$�W�;��O�X�h�"��;QmZ���
����dA<Z,�b��|��?������i�?��]���
^^�(���Q�H �$�"P�~�����F;����LR�����R��}?���r0���$�;��h
��:	���h���8���9y�sY�����o��G����1����~���9;T�������|��0����.������w��h�������|b?����r.]��tQG<5 W�F!����?.�qg�����?,�vN�Z�WY�6^��;�w��^s����y��9�DC"�.��)JO�biuC�W����CB]�����?K���k��������_-�^�3��H�QN���1�d���a��G���%��CF����}��To����l9N���q���Y	�
;����,�e��}�Z�:z��vC����7����
�D~>RMb�a8=����,�����r���Z�CF@�
/��B����1�����'o(�E��/��F��}{�@�i6�	]��~�Zb�����-A�W�T>y�)_m�*
�/vC�0��j"C�sb'�DE���_e��=����s&�6���4�m����m�V(�d%:�A��������ud�����,/d-����7��r"�w�������N>��2e��8������������4F�Y?���M��dn������kRf~C2�ex^��Go��Bw��i**)k�8 4n�V��2�������B�TI�t���6�i�O��h�0o��:���2]eW��U�|V&��6��_K;C�')���G-M�0��M4��s��S��*�ZO����~��?`@A��RZ:�?��H4iE��V��<�zg
���da�g�
F�wkr�DiJ����������|8�9��*G�&���fQ�K��3/��2��^o��j�����Z�C2�4Y�J��H
�S�1r�Q`I���V�vBe$��19
�n
[�l��P=(���Qn33���3

j�[:{��P7�����D�� �}�[O��D����gvf*=1`���(�@�@�4K��,6C�>�I+i�+%c)�q�m��d���N��o��!H#H����M�#h�"�}�O��t��~|����&����C��}7�??{�u����q����Q�t�d���>�W��f�'�J�K!��G��
�c_r��c����G#�`���{G�������?��joe����&${`j��<����o��mr-[c%�+x�� ������Lj!"�J���GDX��H;��$�6�U��n���>''��;�N�UA��s���?�(��W�EdP_���_�X�'�9SB�x�'��r���<������4�D�F����h�;Q9�]�Br$�����_T��h�}I]�#��}
��7����&���D����6W{;rvU<
UZ�G��\GBO��T�ux"
���D��'2"���_A���|_e���� 	�z�������i���~d�_�#�90��R4?oQ�d�)#��Bo=�)8?������O�g�{Ob��{Qm�0�4aF�\��5Y��\EUJ��AKdL���������:M��~��=I�~�N�i��{^���Ni����^���@!���<EBM������D�LN�{��eF�UB_���-Y�O���Kv#�t!�A��Q6H�	/�f�37F������.J_���f�{�jh\��=����������a,=/w�=S��k�D�y'j������a��i�n���E�����-U��1�O�b�����b�,7L��E�~���a���_���@kS.r=���/���9��!&~���)%����5�q�}���d���MUTE5�	L�"�ay
�v�2�����a��������c�M$��w��<��
b�y9����h�L%Gj�����	p
�;�8'����>��[/t>d�d�����M��<�Ug�����!�};��~�p�~���I�e��V'�F}�����;�HQ������=�k�����b��'�0\�����s��i^I��j��U����;�qY��w�@���T =p���Z��q�`!$	(����x���z���&������llT���TfQ5Qv�J&H� ;80<F�:\��2�%�vY"i`�c���`N�GDCR�H�;���sv��}���kd�zE�U�~�|N&��O��&���q8V�
������H�/�oc�xI�����)�&�b�a-��D"����B�i}!������u�2N�tGa'�" �#�`#�/�^0F���t�O�l�qh��%G��(.���C��$	��JM" ��#������9.�To�x\��T���o����uJp�2�*�	b�dZ�P�F�l�������o"Q2RO���$sl��0������RpN�
�tN\��F�Q�c����-b���B��i������:�gS\�O�A��v�>�:g�t|6���w�G�~�#�O�����(�{�@~��O�;_`��D�"�QMSi\��`�L��p��Q�GT�����{_|�<H��'�*G��{t�\�����:�G�|�5����|/��.=��{o���!�����d�TDE��a�gw���V�� x������*,H��X���9�4=�Loi�~Y/���N��T�;�$����Ik����e����9>rl%H��EYV�or@24�I�%��S[���o{U:�
��w��}�Y�G���1����z�'4_��BL�K����SP���|�#pO�Dypi#y��Kr�=>�_|	`b@)D�%R��,�9�v��6
�$������>�8=�!(E����?Vr��B�:�d����4�#�C�w��[m�J��|�tCI��*w,o��<���f�*��������&9�����N�����#�(�����
�X��D���:���K�M�QT���a�!��Q�+fJ$S��
s"�*�����Wgp�(���I1*�JR����T�c�����ZzP�`����u��A�����"�iTN�����N`�D1�^��8x��=y"��46D�j�*�}�~g?A��}WQ�.��"��'J#F�A\����N�,�_�I�I���dl��8����sf]R���Ge���G�`@ s���
���'�,1l�.
��N������k;9/�;�'�bS7[Wm��E��)��v��d1��<�.���e�y��&)l���'5����>��O�}N��J�C��A4��a�'?��8i����XG$���gC%e�kZ��r�#2
	������3�(�z>��zr�`� �0����2/������=�	��_���`��
�����r�����Y:��g���������q
'�M�D������O����G����;#.����U�-u�������b>�&�r0���/�����*��PH�PD��A�C�)i���[19��Cc��e1��woHH+�P$S��7�z��)���2L��Ht�Q::Y�7W�=*�e��No��i�)���M�������gr|��>����c��9�da����t
�7-
x��x�G������|��l���>x>���Ve�f��4�����|Y!�{����������.C���C�3�����v�4f��e��GA��,����-���4J��!�v;�s	@�;JB��;�\�WS*�� h
��h�\�@����u�v�;5��3f�
.�-�5c7G���N��9�GZONbF����}���&���>��y������{!~��9i��������I�����r~o�y��z~#���m]O!��� �S��#>��I�?D|m���q�:Z��\`�3k4.��s`��qs2��Y�2+���
#7��
t����B.Ir����K�8(r1Dp@�"FT��bjG�U�@P�`�B�r�dDJ�]��37�k��rb�39�jR�����?����uN9Ic��w-0�|:���c��������u��a��&��t���9��B����e9{���]�b�1�9��b�6���M=\����=M�:������$�8��^
��j��r���<�*����s�����.P���/����/��2�����M�^w
9�$f�P@�"�.���~������L���/�<H*Md��dYH�F?1������dQ2��#��(�%�}�#L��3c*_^����0j���h��`1Z�0
Jn�4I���hPW�%"F���	���Q��6��c�!���$��MT�b��'6����xw6��Lq�>8n���R�!��	0`N9SU�c�!� I'T@���P����'a ��B�$��$@��
b��������&���P��R!��,`KY��L��IK	�5����j�d����r�3y��-P�U���p�r��D��p��� k���V������%��(����*HS\�8����uw�]c'�P%6
�4F�y�E����S*�H	���q~���#T�O�d-�r������/��������+�2������eA������5��H�*:$�m��*�������m��w���������'$I�(k$$A���P�$�~_g���^�G�e����b������� T�F� ���j1%�rEGgfI�c���uWc����Go���<�I�U-UZ���)��Q���Gb'���~B���.�����h�X�ID`nl��e~�.���p�����t#�@��#eG�f�ZzA& �ii�@Dj�z����\�Y��$�X�,��,��2����}x��ID�*���a��I�KkA���xOLN'	<�t����/�?��$������+6O�4}�y�?��I9=��c��j+�yo7&N~Uq�2uG�H�A�y����}��K��}�F���Igc�R�v��$:#��������7��il�C����18����2[�	:��v�5$e���
t�.�������CI)v���Yr�>��U�C��0��+=<����(h
UN��j0�[��r�������~���~2�]�'p'������2���G�+����:������O��l&��#��o�M�o	?��OJ�a{%L'�)>�����H�y>�;� �H��;��rhM�4g0�]���i#x���;�6C���dv�'({�T�":C�DTW��VI����_�+��DDr����z(D�<I����mvl��� v��r���`�k!�tpx"��B�>@S�^�W���K�`
}�6:OL�W����IJ�&�1�O&����?s��bl�M��g+�������8��f}���.��>�>������ib��'<0�!�V]9��4T�����v�d�E\��.�y:y��Bp�z��C@�N�)zv&o��94]�6��i��Q�q�_'v����m��%����q����Q��
��jx#���H?,N��N�?\Z�Bt�-�\M�x����@rGrhuZT���~}��B�VO6z+�y�s~6��Q��Q��7k	���%����cm�^��t���uz���R���������i����w��������m>*u���;kEIT�,&%�����t�������������$%t-��[k�P,��C��1��1_CG�q�9��5�4������s��m5��gViv[^���~i.��w@��s�n�~������=�r������OiE�%)��`d%�5�dk��U���Iy�����H���*B��)��)�������,��C���?���k(�z?������$�8��)�o��I=�~��9|}�{�Bv����8K�����AQ9�t���t�����z��JoaG�j�L���2�}����>��H���8�P�qG�p�hSN�?x��L��#�"W�O�>��+�������DS��B?����?�l�U�p�K�f�2�d������'��z0�����4}�?�7F�T�~��"&��C��c)�w��f8>�w,OC?=�<6i'���\"��IpU�m�B��8USD�CyR;]}T�0�y)�A��b�;s�	�����y��������9x��� K�Bc��I�2S�-��	�����Q��;'��`��usIR�P�Qm��b�,������W	"�����8-d?_��BYF�SA
�B$B��B�Vmk)m�)���T���RdFU�l���5��jM4�iiL�k5KL��fI����m�4��-F	d�:�_G�;�s=� yz���J�R���%��5���dE�(��j�u���JTX�%d��Q1��~���#SF��7.pN���*HP
(0�'�'�'��]�.�����C�|�<T��#�$�w�C�z��������z�����Lzl��'\M))D�F`��3��H'��h�'
����(��6����j����J�����JyU�����Ef`�1�'��M)G���(����B��Z�h��(;����FR�Y�Im��y��;#xa9��%-8��*.XaI��?�%J�������G����C��<�*DH{N��J�'����
E��JLP���X�r�2E�B0I �x�Bo�r�G����N�tCAb���iR�D���#%�p�.m���fx�Mv1?�qb�Dn�7�
���H�5������l��f��5��]�,�!����O���FRB������%_b:��{]�J�A�ZiY+����O���.��6S���o���y��~���=��N������7�����Sm��-����w{h������`�����,�������|����\��8�7��o|oz�o�C�������}s��Z�����#-/��a6��_��:t�E�b	����-����^V�:��`}i~{�D4���oN"+]��_�n)�����C���B�������*�$��B�D�$#b�����e�������������C���(���eX`�(�K}�B��A7������]S�2!�P@��4�R��D�<��l�xBj.oO&v��<Y9�l�w��#<���I�~g�{D�����>N�eQ�21 �%����$�6@@}�%�����&?=$�}�w���iCX���~_����O^$��������M�S{���]�Y�s��/�j���}|�P����@����$�IE���P�D&�!T�$�
$�����9�����< G��O����r�Q��,��Y�cb�RP�)T��)jJ�kF������M�`��>R���($�* �
&�a	�fAT�#��
!���$h��#��.�}��{��t}g��30�$*dY�S�����_g����\�����F�zp1R1���l�D��$��Ga��l���#�����E��S����~������e���V�@w.8�":�T~��O��~�1���=�,B�����l����T�����A�7��S�J��E�h����T��1��]I<���d��>����,��23-e��	Ir���T��2I�����ER���G��2�%�� *�J��@����H��m (�N�:�������0����#MCP����i���7���������oy����\�#��o�Yu���k����=��I-�I�u��Tr����K������z��� J((Hh�YMe-�����bR5(T4�CQJ��]R�������l�����|�����C��XnK!0r���7����=��� ��Z\�B#��:���vj�$�r 4
�2O���?����W1��V��0�S���U����2;�L���&#{�j��/6���M�����d�������y���<K������4hu��x�j�T�0�qG�CAM��{{��1�@�68O�6�'���8�8rlg;PF�SS�5�����N�pj
�.�<v�|��h�8���;��`9 ���juf��(��{7M���ZTe4)rB�F��n�t
Z�+o�Zn^]���W%V����/��g�hU�����KO����X������+����s�]
g��^p�%�
G�v<�y��*� e	�tb��������D�$A�F&,PG)$��.I+^���,!����}?Q��d��z�^���w����|7����fG��#�H�I����a����������_k=??1��1����=p� N:X���NR�O2��eQ�%x@��{n���%d�@q��.�C��i���&�
</��K��C�r=�L2�X��~	)�<��F�I�P�@��b����S�`y�������2��p�������;����u��hm�@>I�=h}Jx����"��@?���Fstx�!�������)���j#k�Ef���[��F����0��j���H����4���,�!�.@4e���J��E���E���Rvp��^M��\h]�J#�r�2^�Cr��W�|�3J�w�&h�9w��)k���V��I:�C��K��'�kU��^v����Q�G@���������HF�C�f�{��Px9	X�Q�N��������|a���Q��;�����p	kd���9!>8���;|��5<�*97�fy	J�S������&�h9�|���Q	Fxi���:�$���#v��)K�����+%]�Yx��!eX,O��u����
�s!�`�B��pT�i[�K��	)<h�h�aCD�#�g[��l=�Z����I�H#Sr�6�	�2MQG�BT����!�������I$�R��8$�U
�H[���dZ��5�[t�V�rrh��;�%%(*J'��mh*#b�A]ltP��*1�F�Sz�X��������
��
�����,s������R��
.���P`��V��v
��a�Evh��nA�rd����mX�����T��0�����iXF�����(�&Q� �Y-�'�[=V	�JcE[���rCA�]��{F
�.��ID$JM���x�C�N����b�
����4JDK}#Ans7U�r(��Ha�4�uaS`��%��	P�b�z��V.l�7TUL0��_!Z5�BlD�/��7�t,p�r���:'��R�+�2�C�8�:Y������R�6�q�,���*( OX2�q�r����@t��9Lc�5� t��f�����+J���R�6�kk
F3]@�*�(��J�A��e6�B�E�P,��i�>8 ��lR����f�h6���:��j�9�1&����IG���I�8dv���Z:����d����2�����(	[J�{��0T���	�I��$7��,7,:���1���r������F �T���4��	�6���E$���qW�{z���h�S"���a}{����E�wi3���"�M��Lf$��]o�Wz�Qd�s��SE,6KY���|����on/%��!5���7)7i6
�����+l�RQS���_&�����V��V���wu@���k��Z��e"dD�����Qh����d�jB�|.
�>��I�U������" ��@����������@Uj6OD-_(}h${��M6����������H�!��\��#by/���
�n�(�&_HP��0f�)#3��dhhi��$@�V��B����1U����p��H�J#G#p\��-�
!j��;P���4����4X�5��E
=�oH������R��y�W.PI���S&��4��5%�`�4�[�"*dN���DAQ��5]�{7L�Q��w9��m+].$�� ���6�RjJF��5�-ZM��Qey���jQx��l�ch��"Y��f���[MMd�a&���2RY$�l��eI�������ZkSU��F1��IJI�&f�k3
L@�m���C�t9�[���_F����X/�}�<y�V��?m���kR�	a1�;����~; \k�$D	�KQ��31�l&�����������F��~QOl�_��������j�������������#(t�u�Q=:q�4�^u3����3^�B��9%32����/b`[�6u����b������p�����*U����q�����������0�/��"��a�S	��_�=
���K��������%�
�a�}@G�{���K��f�U��X�G[��=)B"��	����{����
q1��TtF�2��2Bd)8R7�$����*�����B�3��;WL�t�
Bh���,o���TS��������:������vX��wwEZ��w*\�Uj���$����m��\lA����}���B#b��~cp�C`���1��gt�b=F����:5&��:�
rPFPXt$B��
�SHu���w(I�Nh�t�9����y<����|��A���U+�	�������X^D��I%�L!�J����c����A�����_��D2�{G���h"�����������C�C�����L���^g�Jg���Gz[4�����������I���2J@�y��'�T�UP? �d?g��K~�>�[?`}�a(K{���C�t�S��E"�]��
J/]�����cr�+V@)���[����$���T�#I+�A�� �L���?*+
����

�bh]�jFMt�n�"�����R�!���i�����|����=�2��&%5:����4*$����H�����|��/���vG�~�X���Xws���~89K���`	`r��'4u�<��O&������m��_���h��������U(*d�P#�v��9�>��,����z�4�$\|���O���#�1������,��"��~t�/P��u��>-'(oO����soL�o�Z��1o��L�ki�d�E��HL�:rj]��_5��
	Q"<Xz�s
�?�7rCh{����.��u��t,��b�{��"S$�����������0m ������|5b'�kL�H�?�#���;E?<��<r�S��w�
X{|o�����r'�>��;��r����mX�74������;�����3�D��M"�<��=�>/>�V�O_���$�(����Z�#�����u>kl�8az5�u�h���A�P�*/�]UC��� �*��r7��}%hb��J�-[H<���L�L�8N��C
�Uh���l5�_+����wq��C&LeQ1E����><^���/r!�-��D����W�Eg��a�|�<�����h�k��v��)G5X0��h�Tb�)�*(�� #%PF��[S2b�FR%��	�)Q����$���3E5$0��Fm&��T��TS`���Qs"�]��X&,�+h�/By���3y���|��#���6��VO���:��gf��v�%W?�4�W��D��Gc�t�_\nz}��{$%� ��u ��G���;d{VKdh�O�iPNGzq������E��d�b#ak���!}����1:f	���iTcccg��W0M����BX\r?�p,IkQQ�2�d����a�o�XK�$F�!���o�/�h�\oz^,�x�4JJ�SJ�������4H{��)��P:��a��J��
E:���3��
H0��0
�S����X�\�l�2MH&��`�R565%�2�?G�	��9|�#��]:8c��.�Z ��f��������r#�G$��?�OK?��2�djE�-c0~��^��t\�30�����1��1G�s ���w}UM�^��v�Ry��HR���FT-�-a���K)�Se��VU�2���il�M�V�!Q@CQ/��1�N7�j�L;,1�r������W��X�h�A�9��!��F#��w��,���hs���/C0�(f
����M�B�������|����HD�DCL��`8�'E�)<@���:�nk�x;x~��G���[8�H�B!P(H���x����z?t�/w�SN�y�F�� ��:�B
^qQy|H��V��'�S����:@�����h�iQ��y�JX�t��,r��6��%�E��m�*�h�BP�\�FQ1@��		I!�0q8O/�cDe��ts"��������`���o�������A"~����>���:����m27�������)�?�{]�B��l�$�m6U��e���$<lA�7��Q�P�����eAU�=���B$h����~x6��w_��'�~���O��?��z��|Urz�npi�i���������8�N���+�N7�\���8�E�:x�1��#�G��3�x| UI�q��/��/���B@{Pi��;��yxo�~0zT�S��/��N������n���G��GPJ��?�Sb��2�BD���zv�:�'����>�����|������z�:�������1��]-�,�'����-� �k��Tt:}��d ��&�!�������&�<��)�Q=b��v#�G��IlY#�?�"b<���t�	�n�(��M���"����XH��4�&Lq`&
;�E[�&9�c�("F/+xB�$\SS�y��<�w���Y��0����z�6r�����N�U#�FQ"��1�k����:Z�"D���_F���Jb�I�R,�DxE'�LH��jH�@���W��~����	�;�/��:Dp�?s2�)��#�C��<��)?O���F�`�����I3�<�$�i�I�Q �/>n�	���	����)�}�
}��J�����dMj5���2NA!��~z���l[H����I���hl�Q48��9�(�B�8`����S��dm|<���v��}�P��C�V:�!/i���
���6D���rk1������+TO��G��H�RWr��D+@
�@)H,J�"�l<H0~����� ��n���dbI �J_L�DE(��0�)� �$���Hj�4��[0�*R�A��R1�&�q$5$���9�)��P��q���� �Q���K�4/tVE!��(��<B��9��@@���66�dN���B~�s;�'���T��j�"����MC��"mE��[�D��)v�d�qc��#x��"����bh
��*�a���v���B��Lg��]�X�7�D�X~	���6��I2~/�#u��f���@��1��IVzA��ct��T�NU��"��H�����-$��n���*^9w(����O�Y$t���{Z���e�?��O�j�@Ft;9���aiH��g$[���~{o�[���0��x��C�q���h��8�y�>�`��N
KbM�]&���Z��Q2����'0���%ZR;�UZ��C�)P��	��N#��}���z�*E��m~�oN�n�LD�C����n�c�
IF��
�wupI����-%�w����������
�����~����YDA� ��FB=���=b2C�	��4">���6���r<��G�#����nR�F�
�A,���'��J����G��$v�8�-���#�U��"
JS3*�g��C5��Z]i]���G�@���BY��pA��p�>g�
�n����lcE��A\�����M��v
��O�ho�T������u��B���u4	�:��������C���]7/���'!��9]�v��n����lsx����C��~/�Qcb|�i>h��h.X���fc2LT��K���0f��?9�6F�i�;�=�	�����\wD����+�E��Q���N��w"��,l�,|*�'��1�2HI�G%Iq�JG�&�J�U�
����R�������J�'���>��i#�����#�h}TER,R2$������IP��-Il��f�*P���[%�
�O4�����!����OD��h�~���;})���Sy��'��Dy11�>UC��=���2�yy����!.J{�s���������+$��4���t�����|H�l���P��#���{����%�� :�P�_���y P����RD$��|���\��4����X�H����u|���?G��Q(*�b�"	J
/	������2�>`�5�h~�b�&+�����Gh��A�)�����H�%���?A�14^�P�B�R�jy�	��aK78~�$�����dZWR`�����!����/��C�{���� ���W��N�D���g}�UM�7=l������!��m�WI�d��-�����A���B9���������'��Z�a0��6-������&�i��3M���%1�fw[ ���LM"ie	V]��#C��:��}2}v���H�	(KH��CH�����jT��e1m$���������5!pt`���EU�H��L|��au�2�.��3�C��7o�{���W��H����*K�E����Yc\�d��JR�B�4$�!�[8��TL����^u�i,�YT����H�H��&���&�N�����^jTf�X6�Y�f�RL�a �I�!mj��$��$������w{_�0_��i ��������RISc�@z��"��������U�.�;\@����6���W��k$�L�mC�RBu����~:�
kR*6G�i<�$-��'��|i�k�v����|���&c�ywl�S?F��e��gb�������v����Wo�N�D�5�U��K�JL06y�j`��l&������s��S�]����$�dKg9��<�`�f	����)|�=
KZ���f&�����B���6�5�iIR��R����6����V�Y$&1 �Xc"��if��JL�EQ�e5F��@���ZRcJf#E
,Bd�Mh�4���1�HF�5����~��~!�R0�Jz���'>���/B���hSX�
��m8!J�z.�G�%�zJ��`�l���bdT$��*����]�.���GF�!��HQ�/������P���b88�o���x�z
�	���'�S�	f%5�;C��G����ro�D�B���6&�iMH]4$'	�'�g%RX�VN�!g�T���Xe�rVS���Bz�t������*�U�>G�����9a�
X2A�`��#hm����b7��idC ����9d�N�3s�|y�u*��7f�f�grx�<��i��U�#b~C��k�2<������D���Gq������2K��?6����
��Q��K�g��KZ�t>nO�D��#�ih*%>�a��
%��@h��{����`r�;c��O����������\);q��r����->����i�NpA��Vh�'X��d�pi�7#�Px�n�mM�����k���&���9�6���������vB��N���$n���7].����N%�[��"�����"����:�����;�U���o/����>a�|���l]��<`������HV��TTmU������Y+$�dj-�p4�p�&]�AH�E��C�S!�����x���m%������ipb�R-�@�v��Z���:���B�q�Gi6;��2 �C��vAm�XP(�,�s�E	B�*��0'5��SB3knA&�'l�������\<��8!�\8py�����~�'��{��s�� �	�!��X$e"�5�WM�JK�W'7ffF�"�4�;[��G�A���-c�p��:.��[m��!6�
V*=�6=�d�0ku��������d��x���l(�L�*2"�bLA�C�dG�&� Z����#��j���x���S5lTc#RHJ"(Ui��y���N�y�Cd�$�����/����IX:��xf�N����-$��b��?��>�*b�9�6N$RP�l��MT��FTe7
�����	�A��D���wF�Z
i�����](���%E��S��4�0�b%xC��rT��B�N�CSbm���6�������!�&z��
}����Uw���P��$,�����t�yw�L~cG�>�1�&��b��?_]��X-Qff�6��y�A"���#��x>y<�M�N�3����	)+CJ���2�H�!"�,���	A2��M;<���
�c�������M�D������>��� �*�Qd�PU:�L��O���hWzr�����>Ui#V�-U|L�c�_�fD�(;|�'(�)(�Y�I������7\/�K(��Y�+z<��6�)���;���X���!�"9��1*b��\��?t�����_������w��'��6�s��>��i�������6�r����d�$
u��G.Z�z�����s��O�N��#��������^��E�JpBm�����/.&��"H,D8�����k��K�dGAt��Jf]��h�9j:�o4���Um�7,���`���jPP���M$r�&imC�C{H	�2�1n�Rx��r3&����N,g������.�~�&����1	�������F*1Q��I��Tb���)����aLra���)dF�M!�#S�]����/�b�B=�	�.J�*5�~j�t���W�!�#�����7)��A��~7�����%�W�\�M����I)��Gf��F0`�	(��p�*�5�E�m$]���jQ�����\F� &��g	�gg��b�wG3K���#I���,b:�n��$���V����5$md�4�
���h�*�S����A�?o�;v/"	%��,�	�r���G��o��c�x��D�����m������4�~k&�6'���E��8c�d�S(�T�����f�8Y4���89��~�0��z��&VSD$o	�?��4�8C���@p�98jx
�4���d�p8Sc�H$����4���gC��+�������Hr�����9�l%!+�BR4�'\�H:A_�_7E���"$�$�\1T� "�@�2��_A@�6@��(D!T�Vf���&��Q����Hh����j���P�P(�
PF�]�$4@p�l;��0*C
�m�<��~�<��^�!�D�,�2>9-�V>�$��?�9!"@"�a�?Q�d�abU�A�@���_�I���E��������__����yGGAq|;Q�@��v�H,4�$���a;��Z)���K�#$�*�I$KTL�a�S�dp9,���W�#�z'QK>6�D��R�re��)Kt�1)VH������O�N����.F���[*���Z�)����J"$���}=�����30���^n�}B����5`���;�Q���$
�v�H�� ��Gi����������8S(IP}\��2����_&�u��@p���T�q�?W��{l���s���$f&-�1���;ma$����cB�	�]Y%��	8��W�������I"T���32vA��=�1=����SY����u7����Wt���yk���;���8�:t�u��R��f:!Y�Ye���U~
�Yd�(�Uu	�R�J}HR;��&���eC��v�%�!�;^�"��������������)�{04_K��I��
��V�����HUL�|	���l`��h�('�%O�b�jBK
`�;	���������������Eh�E<�����mh�V�+�c�\�O��|��?����rW�<�CaLU��U���Em��.��<%����S������a��|��Y�����f���r���I��$��9�#���=����0�E?�t�#�C��� xT��aZ��ZL��)�V�f����}|�8���";��-�x���D��#��,z�D�������K(��D+N���~���uQ�q*�
�Wa�~�������i����SURb�9v�����vEl,����5����~nS���p)A�s�����I�u�f�>"�6C��2�E�����H��'-��/��*�����N�M4C���6v��A���8>v����l.�J����4(vA��Bhd����,g#�~��
"&�����jJJ�U>QLJX��M�n��Kb��RqZ�6�&�B��P�X�U)A�P�m
�BI
��	�sb����fU��0�0�z|�+���QD"R�CH�
D�B BJ�A�d?��_���?��������������?�����?�������=������P���r�����E'���o����I�_3����q��^����"���_C$�`��h�6G�K9S�Jh���;V��EX���4��w�;�y|����dD!4�3�������hR5#AH��6HI��0P�dL�2�b$"������q�%�����e@��)��B��b���
��
�����[v2v�A���_R��o����F91�,����(�`�����"t�h&c�����3��Q�����
�����/�ztT�TO�l���W�����}(�������B���f,�M���7O7_J�w�OT�TT)?�u������/b��8d��2��66_��1�a��q�����J@�/���'��O�]%"���=�u:�((/Z!��g#�oG/?��9y�:�h�"&�H!��m��,e���$���*P�=��K��*����})���L��!���@�$
:	�9�%���pp z����`�
N�"h�Y��L�>�>y���@����E:32Be��{g9�"F�CLPgLm)��$�[I�6��EDV��D�P���;!M!�a�#������(�_��UZ���������X����4�� gZ�K$�5�w��8RGY��dj�':]-fR9H�:Wk2����Q�#c�vB��&��=��'o�w�8D,�>���1�FN�1�F��X�H�H����/~s�vz>�>�������NV�GLt����v�����=�`��K�7���/�><=��`P�'�m��^�fa���3H4����Fj_������g6�{P,�r��Gn��*��_u���I>��RI�0�sd�-E!lJ(�+n��I��{�W�����A�>�
�!}�
���R����[g�xJH8{K�@O��>k-x�����u�!L��m1�u5��xit���%��c�"��!>B���Q�G�����P�8UT�;x�z5�HpD�����8D��.	���@�S��n�Y�h�I��Td�'@��*�ow�b�0+1�dM�6,�4�2&�,%\C�$
h�&��E��A�}O�����PI��X��F�������~_�U�K��C��?O@���S�	�&��k����pQb�	bY�I�(���L�y:D��$�bB��
TR@���lU�B4`�E�A�$P�
�RHR��5�*�I���HH����E����CD�Gd�m�����LOm4�J�&�m$���A�/ls�8�=� rP=0���QJ��]2�!�������87*��d���G[�f
�zt����|��ju�,�#�K�d�L�V�-2���:��l�7���#��jI9������IE�K)i��)��h���B@zO��@Lv�z� ����J�����p<I��'F�t��Y�z|����-y�>�
�x/z�F�n�jYK�_+Z�?H���i[�Ti�hW7����!���i0�K+E@��:�77t��4��`�Y�5���mz��*��I�������J�(v$+Z��*C����iMT:�r�F������ `��KKFw|�8���d�S����&P�"�q���$�T�������^AA�����T)04IX���r�����-����:)F�EC�RT"�:�b��������ch6\��T*e%�*E"E&�%m%3!Q� ����0pP��pT�5D���Bxc��xY�����I=2h�C.w5��3���^6����c���]��BC[7����[������r9VH��T��Im(�����G��|��i��aQ�Q(U��D@&�)b^�nzV� bV�	!��Rr	m��P����3g�|��9���,p�`�`Q�����h\2PV���0j��a,1N89����(���{�
BM-���&�Y"�#��H�7PR�Y��5	4�6Q��ea4�h��	�v���"���LmO;Z���0�����n������=ep������U]���kC�SVD����{�1a���Y�<��X�����D+#�Uy�bs4����wcf;e����D�5M��v�<l���+D���F�P���@��Wg�6w�����A����B�@�����W4qm9l'{c*�)m��zl����%R0���Q��������srsnq��������_����0Hk�h�f���m�"jY�	���Hz�>M�2�������$
���#6O�0�1���q�ta������]8�1{��)i�J�c�{���U��k�����{oS��{]1��I�1pBC{;-���h�jtn�'�w���	?6�l���}i����{��~}�?)��:6L����:�yq�_i�$�&�����^hWV��F�,�&I=%C���8T{���q=s�cb�k���7�s>�l�NGH�?�0)�{_Q!����GI����,(�j\�n����R����t�q�u��}�p��<$�;/E���������Ss�������
�:��A?�P��`�{"n!�-J4�I��mUu8SG8�-�7�
���}�ox�;N���XP�1Q��'�x�}�t<g�C��[$�i�a6��f�����C?Q�<��$�l��C�$��:�L|'�M,[QK*�������TF6J���*M%IQJ����b,F5%��"�RT�%IQJ���4l�ABL�|G��2$���}Y����6��D:�m:���}��h�� ,j5,�4��o1)�0i���zR4oB�nL�q�5��2p��t���%kY�J.���:D�<*L"iD�&J�2�I�$�rGq��;�1qnO"{x����LE$���R���BO��D�_L��r�n2��'$:f%4�U�w%t��fxZ�:B�&�G) B�V�$��A#����\�����im�������ZL������{q���>]���}):m�
��)B���*�w�@�� �,:M2�-�CHl6�~N����MHf]H*H�N��c�0I�g��T�ya��EqI_��@�/���D���dZ���v��M���@U��z)��������L7=4"zjt4�M����>�Cfm���LW�E%��Z@��`�^1Q|4a���D<��X`����2�j��r�\�|VK:�y����T�����yQ������S�����*��Q����z'�E��!�%6��s����W���Yy��hSE��Y��s�m<���7��wV�����J���J���|������no��
W�����B�1 L� ��*�����TZ�h!��(eF�U`��(�v&fdKl9P3"ffY�&|{��O.�����gu�@�;��GoH�������������n(R_)|#�4`(�~5���-W�PA��Ll���Z����'84�BX�v�M��X��$E���sM�b���M�GyrvK�N�&R9�����7���&�'E���q@����a!@��Q��=������}�z��h)$����S��"P��Fc�(w�����)O��&o_����o?c��f1-��7���J��Fr0*�������]��U���>m���cjA��r�\�s,������
1�\����c��P��6J	A3#(D�vA1R. ���
�4C�9@h���`�i��&=����I*��V�B��,$7Q�yh��;�@�{WlFjAEZ��N�
Ml���T�0R��f�]����b��c��F�JFx ��f'�L�jZ�Y&
�t����Y�
�/�\��M��5���f6�bZT���&�!ha�Zj2��2�[��P.I0"�����Pu��(a�Fm\�Z��"�P�1KPm4��`���N�26��$��Udch�eZ02
l(
&f�-]��+e��j�4 v��X4��[��Q�>u�Zmf�
��rT��K;k3>�t*` �
����p��4p����49��Gp�k4�f���bc.20X74����i�$,�|�����Y c���iq��M��K-�p���M��8�!����cyF�	F*�h�H���%$���	E����7���P��T�v��1��'a��>v�����n
:-��`�3����A����H���1,�iL4�#���/��{���r?.����>�o�LH)��d�&���"�*q!�-J��r)u6X�pE��N�Q�IdK�2��g�C�V���������8�-�P�f��������Z��3�R�WcR�DQ`P��l�C`��`�G<�����O�^4D&��b�B�����8��zo
s�U��qO�Vt�����)x����G '�}o�k�����n����w��`��~��D��m��}'���y>!:�1����������.�j�ZE%p�s��:0m�����l����k��l�d�P�I
�`��O������������T;%�H���#h1Nn�8q'8D#{M�����{U��]���u�k3,���y'A�"���c��O6tv���M�.�s�N%�P�����y�7���7:��? �aQ,��b����N&��6��@)!��3������sy��N��	B<�?���=A�H�O����TH������A*��RJ2 �K0�!"�*R*@
$�J�����b����-���H8��]��z��NdQUABid��p��C$0e%f?�D}R�iQD=�$H���F4�U������<@��dT��%P�*��(Ck"!��b!.��K�*���A�d�g��V��,M&�+%���H���0%�|�Q���C��]�Zt��U�J�*	D!�P��\�T�d��~[
�����Ep��$n���o$8��glb9�)$s���N'���z�����4�c���E�+��gJy����:�T��WwfF�_<��$����bq��u'%�Jh@�I�&���-��O����:B������Z�M�H�9x<����X�y��$G�z��1����acI7t��{�lk�7b�������f
ke���gs&C��g���l<l ���1�. ���U�D��#�(tp�T�j)G������c>>��?F�X=H=~C��J,[Ap�D��c@w2�B�!H�@�D#��r�H��rz\7���D( #�A�"T"D!�g��vO���`�>�^���)%{�3��<
A0��P@B�I#����&%��R�@��}���ftH�$�T�u3�?mZ�������*J�zAZ�g����
	E�n� ������!��$
}*����d���6�}�r���j�<^��� $*T7=�5�� ������������eA��i8��:�G�r���#�N��.4{NF&�Nk���D�h�Ct��E���8���(b��p8r�$�Cd�a�G<��B��8���w|$&��'��^q�b�h<�>�M�q����a��2��3 ����������P�'�� �	VY%0�i~���x?A	��@0U�(N����.`�����#��eY��o��l\��L%��"�m#�P�-ID�KE,L����D��%D��j=b I�IO�	/��;�=�d���_x9��|,0K���1)	�����R���z����IE�T*����i�K����ro����c�3��?k�DU%��a<����_�)����h���>�I����c�^�"���EQ��"O���G�Q����Q#&>�2B�]Y)��1�0�����z���&����4�VV�J���kRUfU��QlQ�U%���K$������F5��I"�Sf���U��`�4�H���Q"�2��V���"uRw��2�O�-0I'N���y��{`� s:J/HE�$x�0�P��P�,I0����T���s���~�+������?���4�1S��1I�&,��)f>��+��)W��6�m�����@�BF��BTd%�a$�ch�f���;L|�6I������Yc�>������O��L�+���?V`��^��GI���!���_�S��Ac��;��{���L@��%4pC��,s��nt���R��w���x��M�+V�5��O�J�AM��^D@�e�:Yi��%�_oi1p,�9>(�_�H��|�x��*��0���:����������T??9�����e�;������C�b/�!|�mQ�~B���������XIEJ�P	!d�a�@���B��P
A���7�"�D�P%(��B���O_�z���:�5BD�?X���#>��
�'T��
0iH�
�A�hd�4�j��QJ\��l��U!S0@�����0l�<�#����H5�2QvH����V��!I���!`*��]::�1+h�X��]����m�v�,w���iC"���t���o����	�%4AH`���Z�����n]��i,��rX�����D��G���V�}���w�"W���n����"��g��Gx����Y&��� ���1
����6����M'9�G?�g�����g��f��%\g�<�����|������yO�v�\[���sw��1��^����9�
Z����BVJU��$RR8C�(�:��h��D��1��:S��D$���P�v����7Z���"���KIE.�S����)������

#60

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 8 years ago

In reply to: Kyotaro HORIGUCHI (#59)

3 attachment(s)

Hello.

Fully-asynchronous executor needs that every node is stateful and
suspendable at the time of requesting for the next tuples to
underneath nodes. I tried pure push-base executor but failed.

After the miserable patch upthread, I finally managed to make
executor nodes suspendable using computational jump and got rid
of recursive calls of executor. But it runs about x10 slower for
simple SeqScan case. (pgbench ran with 9% degradation.) It
doesn't seem recoverable by handy improvements. So I gave up
that.

Then I returned to single-level asynchrony, in other words, the
simple case with async-aware nodes just above async-capable
nodes. The motive of using the framework in the previous patch
was that we had degradation on the sync (or normal) code paths by
polluting ExecProcNode with async stuff and as Tom's suggestion
the node->ExecProcNode trick can isolate the async code path.

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases. (Additional
members in PlanState made degradation seemingly comes from
alignment, though.)

But I haven't had enough stable result from performance
test. Different builds from the same source code gives apparently
different results...

Anyway I'll show the best one in the several times run here.

original(ms) patched(ms) gain(%)
A: simple table scan : 9714.70 9656.73 0.6
B: local partitioning : 4119.44 4131.10 -0.3
C: single remote table : 9484.86 9141.89 3.7
D: sharding (single con) : 7114.34 6751.21 5.1
E: sharding (multi con) : 7166.56 1827.93 74.5

A and B are degradation checks, which are expected to show no
degradation. C is the gain only by postgres_fdw's command
presending on a remote table. D is the gain of sharding on a
connection. The number of partitions/shards is 4. E is the gain
using dedicate connection per shard.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From fc424c16e124934581a184fcadaed1e05f7672c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index 754154b..d459f32 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4eb6e83..e6fc3dd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index bd19fad..d36481e 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -552,6 +555,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -699,6 +712,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -725,6 +739,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1267,3 +1282,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a43193c..997ee8d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-core-side-modification.patchtext/x-patch; charset=us-asciiDownload

From 1b213d238c398dc77cb31cf2a92284c70d292e9e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 110 ++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 194 ++++++++++++++++++++++++++++++--
 src/backend/executor/nodeForeignscan.c  |  22 +++-
 src/backend/optimizer/plan/createplan.c |  56 ++++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/execAsync.h        |  23 ++++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 13 files changed, 428 insertions(+), 20 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 083b20f..21f5ad0 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..f7daed7
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+	pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+					   void *data, bool reinit)
+{
+	switch (nodeTag(node))
+	{
+	case T_ForeignScanState:
+		return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+											 wes, data, reinit);
+		break;
+	default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(node));
+	}
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+	static int *refind = NULL;
+	static int refindsize = 0;
+	WaitEventSet *wes;
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int noccurred = 0;
+	Bitmapset *fired_events = NULL;
+	int i;
+	int n;
+
+	n = bms_num_members(waitnodes);
+	wes = CreateWaitEventSet(TopTransactionContext,
+							 TopTransactionResourceOwner, n);
+	if (refindsize < n)
+	{
+		if (refindsize == 0)
+			refindsize = EVENT_BUFFER_SIZE; /* XXX */
+		while (refindsize < n)
+			refindsize *= 2;
+		if (refind)
+			refind = (int *) repalloc(refind, refindsize * sizeof(int));
+		else
+			refind = (int *) palloc(refindsize * sizeof(int));
+	}
+
+	n = 0;
+	for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+		 i = bms_next_member(waitnodes, i))
+	{
+		refind[i] = i;
+		if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+			n++;
+	}
+
+	if (n == 0)
+	{
+		FreeWaitEventSet(wes);
+		return NULL;
+	}
+
+	noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+								 EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+	FreeWaitEventSet(wes);
+	if (noccurred == 0)
+		return NULL;
+
+	for (i = 0 ; i < noccurred ; i++)
+	{
+		WaitEvent *w = &occurred_event[i];
+
+		if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		{
+			int n = *(int*)w->user_data;
+
+			fired_events = bms_add_member(fired_events, n);
+		}
+	}
+
+	return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index bed9bb8..5355bb2 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,9 +59,11 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool exec_append_initialize_next(AppendState *appendstate);
 
 
@@ -81,16 +83,16 @@ exec_append_initialize_next(AppendState *appendstate)
 	/*
 	 * get information from the append node
 	 */
-	whichplan = appendstate->as_whichplan;
+	whichplan = appendstate->as_whichsyncplan;
 
-	if (whichplan < 0)
+	if (whichplan < appendstate->as_nasyncplans)
 	{
 		/*
 		 * if scanning in reverse, we start at the last scan in the list and
 		 * then proceed back to the first.. in any case we inform ExecAppend
 		 * that we are at the end of the line by returning FALSE
 		 */
-		appendstate->as_whichplan = 0;
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 		return FALSE;
 	}
 	else if (whichplan >= appendstate->as_nplans)
@@ -98,7 +100,7 @@ exec_append_initialize_next(AppendState *appendstate)
 		/*
 		 * as above, end the scan if we go beyond the last scan in our list..
 		 */
-		appendstate->as_whichplan = appendstate->as_nplans - 1;
+		appendstate->as_whichsyncplan = appendstate->as_nplans - 1;
 		return FALSE;
 	}
 	else
@@ -128,7 +130,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ListCell   *lc;
 
 	/* check for unsupported flags */
-	Assert(!(eflags & EXEC_FLAG_MARK));
+	Assert(!(eflags & EXEC_FLAG_MARK | EXEC_FLAG_ASYNC));
 
 	/*
 	 * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -151,6 +153,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ExecProcNode = ExecAppend;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* Choose async version of Exec function */
+	if (appendstate->as_nasyncplans > 0)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -173,11 +188,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	foreach(lc, node->appendplans)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
+		int			sub_eflags = eflags;
+
+		if (i < appendstate->as_nasyncplans)
+			sub_eflags |= EXEC_FLAG_ASYNC;
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
 		i++;
 	}
 
+	/* if there's any async-capable subnode, use async-aware routine */
+	if (appendstate->as_nasyncplans)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
 	/*
 	 * initialize output tuple type
 	 */
@@ -187,7 +210,10 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	/*
 	 * initialize to scan first subplan
 	 */
-	appendstate->as_whichplan = 0;
+	/*
+	 * initialize to scan first synchronous subplan
+	 */
+	appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
 	exec_append_initialize_next(appendstate);
 
 	return appendstate;
@@ -204,6 +230,8 @@ ExecAppend(PlanState *pstate)
 {
 	AppendState *node = castNode(AppendState, pstate);
 
+	Assert(node->as_nasyncplans == 0);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -214,7 +242,7 @@ ExecAppend(PlanState *pstate)
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -237,9 +265,9 @@ ExecAppend(PlanState *pstate)
 		 * ExecInitAppend.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+			node->as_whichsyncplan++;
 		else
-			node->as_whichplan--;
+			node->as_whichsyncplan--;
 		if (!exec_append_initialize_next(node))
 			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
@@ -247,6 +275,141 @@ ExecAppend(PlanState *pstate)
 	}
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+	AppendState *node = castNode(AppendState, pstate);
+	Bitmapset *needrequest;
+	int	i;
+
+	Assert(node->as_nasyncplans > 0);
+
+	if (node->as_nasyncresult > 0)
+	{
+		--node->as_nasyncresult;
+		return node->as_asyncresult[node->as_nasyncresult];
+	}
+
+	needrequest = node->as_needrequest;
+	node->as_needrequest = NULL;
+	while ((i = bms_first_member(needrequest)) >= 0)
+	{
+		TupleTableSlot *slot;
+		PlanState *subnode = node->appendplans[i];
+
+		slot = ExecProcNode(subnode);
+		if (subnode->asyncstate == AS_AVAILABLE)
+		{
+			if (!TupIsNull(slot))
+			{
+				node->as_asyncresult[node->as_nasyncresult++] = slot;
+				node->as_needrequest = bms_add_member(node->as_needrequest, i);
+			}
+		}
+		else
+			node->as_pending_async = bms_add_member(node->as_pending_async, i);
+	}
+	bms_free(needrequest);
+
+	for (;;)
+	{
+		TupleTableSlot *result;
+
+		/* return now if a result is available */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		while (!bms_is_empty(node->as_pending_async))
+		{
+			long timeout = node->as_syncdone ? -1 : 0;
+			Bitmapset *fired;
+			int i;
+
+			fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+									   timeout);
+			while ((i = bms_first_member(fired)) >= 0)
+			{
+				TupleTableSlot *slot;
+				PlanState *subnode = node->appendplans[i];
+				slot = ExecProcNode(subnode);
+				if (subnode->asyncstate == AS_AVAILABLE)
+				{
+					if (!TupIsNull(slot))
+					{
+						node->as_asyncresult[node->as_nasyncresult++] = slot;
+						node->as_needrequest =
+							bms_add_member(node->as_needrequest, i);
+					}
+					node->as_pending_async =
+						bms_del_member(node->as_pending_async, i);
+				}
+			}
+			bms_free(fired);
+
+			/* return now if a result is available */
+			if (node->as_nasyncresult > 0)
+			{
+				--node->as_nasyncresult;
+				return node->as_asyncresult[node->as_nasyncresult];
+			}
+
+			if (!node->as_syncdone)
+				break;
+		}
+
+		/*
+		 * If there is no asynchronous activity still pending and the
+		 * synchronous activity is also complete, we're totally done scanning
+		 * this node.  Otherwise, we're done with the asynchronous stuff but
+		 * must continue scanning the synchronous children.
+		 */
+		if (node->as_syncdone)
+		{
+			Assert(bms_is_empty(node->as_pending_async));
+			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		/*
+		 * get a tuple from the subplan
+		 */
+		result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+		if (!TupIsNull(result))
+		{
+			/*
+			 * If the subplan gave us something then return it as-is. We do
+			 * NOT make use of the result slot that was set up in
+			 * ExecInitAppend; there's no need for it.
+			 */
+			return result;
+		}
+
+		/*
+		 * Go on to the "next" subplan in the appropriate direction. If no
+		 * more subplans, return the empty slot set up for us by
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
+		 */
+		if (ScanDirectionIsForward(node->ps.state->es_direction))
+			node->as_whichsyncplan++;
+		else
+			node->as_whichsyncplan--;
+		if (!exec_append_initialize_next(node))
+		{
+			node->as_syncdone = true;
+			if (bms_is_empty(node->as_pending_async))
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/* Else loop back and try to get a tuple from the new subplan */
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecEndAppend
  *
@@ -280,6 +443,15 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+	{
+		ExecShutdownNode(node->appendplans[i]);
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	}
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -298,6 +470,6 @@ ExecReScanAppend(AppendState *node)
 		if (subnode->chgParam == NULL)
 			ExecReScan(subnode);
 	}
-	node->as_whichplan = 0;
+	node->as_whichsyncplan = node->as_nasyncplans;
 	exec_append_initialize_next(node);
 }
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 20892d6..e851988 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
 					(ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
 	scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+	scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+	if ((eflags & EXEC_FLAG_ASYNC) != 0)
+		scanstate->fs_async = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -388,3 +391,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+							  void *caller_data, bool reinit)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+												 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 792ea84..53eb56d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -203,7 +203,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 						 Index scanrelid, char *enrname);
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
-static Append *make_append(List *appendplans, List *tlist, List *partitioned_rels);
+static Append *make_append(List *appendplans, int nasyncplans,	int referent,
+						   List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -283,6 +284,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1004,8 +1006,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1040,7 +1046,18 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/* Classify as async-capable or not */
+		if (is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1050,7 +1067,9 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans), nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5281,7 +5300,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int nasyncplans,	int referent,
+			List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5292,6 +5312,8 @@ make_append(List *appendplans, List *tlist, List *partitioned_rels)
 	plan->righttree = NULL;
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6628,3 +6650,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 3a0b49c..4c6571e 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+								   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+									 long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 37fd6b2..2ab9d72 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0354c2c..fed46d7 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								shm_toc *toc);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+										  WaitEventSet *wes,
+										  void *caller_data, bool reinit);
 
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 04e43cc..566236b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+													WaitEventSet *wes,
+													void *caller_data,
+													bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	IterateForeignScan_function IterateForeignScanAsync;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 
 	/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index c461134..7f663eb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -840,6 +840,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+	AS_AVAILABLE,
+	AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
 	NodeTag		type;
@@ -880,6 +886,9 @@ typedef struct PlanState
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
+
+	AsyncState	asyncstate;
+	int32		padding;			/* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1003,7 +1012,13 @@ typedef struct AppendState
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
-	int			as_whichplan;
+	int			as_nasyncplans;	/* # of async-capable children */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	Bitmapset  *as_pending_async;	/* pending async plans */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
 } AppendState;
 
 /* ----------------
@@ -1546,6 +1561,7 @@ typedef struct ForeignScanState
 	Size		pscan_len;		/* size of parallel coordination information */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	bool		fs_async;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index a382331..e0eccc8 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -248,6 +248,8 @@ typedef struct Append
 	/* RT indexes of non-leaf tables in a partition tree */
 	List	   *partitioned_rels;
 	List	   *appendplans;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..fe9d39c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

From 9f6a16ef7f7d1a38353216191641deb0d3ea58e7 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index be4ec07..00301d0 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
+		entry->storage = NULL;
 	}
 
 	/*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	bool		found;
+	ConnCacheEntry *entry;
+	ConnCacheKey key;
+
+	key = user->umid;
+	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+	Assert(found);
+
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4339bbf..2a0a662 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6512,7 +6512,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6540,7 +6540,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6568,7 +6568,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6596,7 +6596,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6662,35 +6662,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6700,35 +6705,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6758,11 +6768,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6776,11 +6786,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6811,16 +6821,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6838,16 +6848,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -6998,27 +7008,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..0688504 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+											  WaitEventSet *wes,
+											  void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					node->ss.ps.asyncstate = AS_WAITING;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_conn_owner->fs_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
+	node->ss.ps.asyncstate = AS_AVAILABLE;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+								  void *caller_data, bool reinit)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(wes,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, caller_data);
+		return true;
+	}
+
+	return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..41ac1d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ddfec79..56aae91 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

#61

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 8 years ago

In reply to: Kyotaro HORIGUCHI (#60)

3 attachment(s)

Re: [HACKERS] asynchronous execution

Hello,

At Fri, 20 Oct 2017 17:37:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171020.173707.12913619.horiguchi.kyotaro@lab.ntt.co.jp>

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases.

The parallel append just committed hit this and the attached are
the rebased version to the current HEAD. The result of a concise
performance test follows.

patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3562.32 3444.81 -3.4
B: local partitioning : 1451.25 1604.38 9.5
C: single remote table : 8818.92 9297.76 5.1
D: sharding (single con) : 5966.14 6646.73 10.2
E: sharding (multi con) : 1802.25 6515.49 72.3

A and B are degradation checks, which are expected to show no
degradation. C is the gain only by postgres_fdw's command
presending on a remote table. D is the gain of sharding on a
connection. The number of partitions/shards is 4. E is the gain
using dedicate connection per shard.

Test A is accelerated by parallel sequential scan. Introducing
parallel append accelerates test B. Comparing A and B, I doubt
that degradation is stably measurable at least my environment but
I believe that there's no degradation theoreticaly. The test C to
E still shows apparent gain.
regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From b1aff3362b983975003d8a60f9b3593cb2fa62fc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index fc15181..7c4077a 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 4eb6e83..e6fc3dd 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index b4b7d28..182f759 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -66,7 +66,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	/* Create a reusable WaitEventSet. */
 	if (cv_wait_event_set == NULL)
 	{
-		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, 1);
+		cv_wait_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 1);
 		AddWaitEventToSet(cv_wait_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 	}
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 4c35ccf..e00e39c 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a43193c..997ee8d 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 2420b65..70b0bb9 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-core-side-modification.patchtext/x-patch; charset=us-asciiDownload

From 9c1273a4868bed5eb0991f842296cb89c10470bc Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 110 ++++++++++++++
 src/backend/executor/nodeAppend.c       | 247 +++++++++++++++++++++++++++-----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/optimizer/plan/createplan.c |  62 +++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/execAsync.h        |  23 +++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 13 files changed, 462 insertions(+), 45 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..8ad2adf 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..f7daed7
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+	pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+					   void *data, bool reinit)
+{
+	switch (nodeTag(node))
+	{
+	case T_ForeignScanState:
+		return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+											 wes, data, reinit);
+		break;
+	default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(node));
+	}
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+	static int *refind = NULL;
+	static int refindsize = 0;
+	WaitEventSet *wes;
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int noccurred = 0;
+	Bitmapset *fired_events = NULL;
+	int i;
+	int n;
+
+	n = bms_num_members(waitnodes);
+	wes = CreateWaitEventSet(TopTransactionContext,
+							 TopTransactionResourceOwner, n);
+	if (refindsize < n)
+	{
+		if (refindsize == 0)
+			refindsize = EVENT_BUFFER_SIZE; /* XXX */
+		while (refindsize < n)
+			refindsize *= 2;
+		if (refind)
+			refind = (int *) repalloc(refind, refindsize * sizeof(int));
+		else
+			refind = (int *) palloc(refindsize * sizeof(int));
+	}
+
+	n = 0;
+	for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+		 i = bms_next_member(waitnodes, i))
+	{
+		refind[i] = i;
+		if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+			n++;
+	}
+
+	if (n == 0)
+	{
+		FreeWaitEventSet(wes);
+		return NULL;
+	}
+
+	noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+								 EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+	FreeWaitEventSet(wes);
+	if (noccurred == 0)
+		return NULL;
+
+	for (i = 0 ; i < noccurred ; i++)
+	{
+		WaitEvent *w = &occurred_event[i];
+
+		if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		{
+			int n = *(int*)w->user_data;
+
+			fired_events = bms_add_member(fired_events, n);
+		}
+	}
+
+	return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 0e93713..f21ab36 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,7 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -79,6 +80,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX		-1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ListCell   *lc;
 
 	/* check for unsupported flags */
-	Assert(!(eflags & EXEC_FLAG_MARK));
+	Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
 	/*
 	 * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ExecProcNode = ExecAppend;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* Choose async version of Exec function */
+	if (appendstate->as_nasyncplans > 0)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -149,27 +164,48 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	foreach(lc, node->appendplans)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
+		int			sub_eflags = eflags;
+
+		if (i < appendstate->as_nasyncplans)
+			sub_eflags |= EXEC_FLAG_ASYNC;
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
 		i++;
 	}
 
+	/* if there's any async-capable subnode, use async-aware routine */
+	if (appendstate->as_nasyncplans)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
 	/*
 	 * initialize output tuple type
 	 */
 	ExecAssignResultTypeFromTL(&appendstate->ps);
 	appendstate->ps.ps_ProjInfo = NULL;
 
-	/*
-	 * Parallel-aware append plans must choose the first subplan to execute by
-	 * looking at shared memory, but non-parallel-aware append plans can
-	 * always start with the first subplan.
-	 */
-	appendstate->as_whichplan =
-		appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
+	if (appendstate->ps.plan->parallel_aware)
+	{
+		/*
+		 * Parallel-aware append plans must choose the first subplan to
+		 * execute by looking at shared memory, but non-parallel-aware append
+		 * plans can always start with the first subplan.
+		 */
 
-	/* If parallel-aware, this will be overridden later. */
-	appendstate->choose_next_subplan = choose_next_subplan_locally;
+		appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
+
+		/* If parallel-aware, this will be overridden later. */
+		appendstate->choose_next_subplan = choose_next_subplan_locally;
+	}
+	else
+	{
+		appendstate->as_whichsyncplan = 0;
+
+		/*
+		 * initialize to scan first synchronous subplan
+		 */
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
+		appendstate->choose_next_subplan = choose_next_subplan_locally;
+	}
 
 	return appendstate;
 }
@@ -186,10 +222,12 @@ ExecAppend(PlanState *pstate)
 	AppendState *node = castNode(AppendState, pstate);
 
 	/* If no subplan has been chosen, we must choose one before proceeding. */
-	if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+	if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
 		!node->choose_next_subplan(node))
 		return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
+	Assert(node->as_nasyncplans == 0);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -200,8 +238,9 @@ ExecAppend(PlanState *pstate)
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(node->as_whichsyncplan >= 0 &&
+			   node->as_whichsyncplan < node->as_nplans);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -224,6 +263,137 @@ ExecAppend(PlanState *pstate)
 	}
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+	AppendState *node = castNode(AppendState, pstate);
+	Bitmapset *needrequest;
+	int	i;
+
+	Assert(node->as_nasyncplans > 0);
+
+	if (node->as_nasyncresult > 0)
+	{
+		--node->as_nasyncresult;
+		return node->as_asyncresult[node->as_nasyncresult];
+	}
+
+	needrequest = node->as_needrequest;
+	node->as_needrequest = NULL;
+	while ((i = bms_first_member(needrequest)) >= 0)
+	{
+		TupleTableSlot *slot;
+		PlanState *subnode = node->appendplans[i];
+
+		slot = ExecProcNode(subnode);
+		if (subnode->asyncstate == AS_AVAILABLE)
+		{
+			if (!TupIsNull(slot))
+			{
+				node->as_asyncresult[node->as_nasyncresult++] = slot;
+				node->as_needrequest = bms_add_member(node->as_needrequest, i);
+			}
+		}
+		else
+			node->as_pending_async = bms_add_member(node->as_pending_async, i);
+	}
+	bms_free(needrequest);
+
+	for (;;)
+	{
+		TupleTableSlot *result;
+
+		/* return now if a result is available */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		while (!bms_is_empty(node->as_pending_async))
+		{
+			long timeout = node->as_syncdone ? -1 : 0;
+			Bitmapset *fired;
+			int i;
+
+			fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+									   timeout);
+			while ((i = bms_first_member(fired)) >= 0)
+			{
+				TupleTableSlot *slot;
+				PlanState *subnode = node->appendplans[i];
+				slot = ExecProcNode(subnode);
+				if (subnode->asyncstate == AS_AVAILABLE)
+				{
+					if (!TupIsNull(slot))
+					{
+						node->as_asyncresult[node->as_nasyncresult++] = slot;
+						node->as_needrequest =
+							bms_add_member(node->as_needrequest, i);
+					}
+					node->as_pending_async =
+						bms_del_member(node->as_pending_async, i);
+				}
+			}
+			bms_free(fired);
+
+			/* return now if a result is available */
+			if (node->as_nasyncresult > 0)
+			{
+				--node->as_nasyncresult;
+				return node->as_asyncresult[node->as_nasyncresult];
+			}
+
+			if (!node->as_syncdone)
+				break;
+		}
+
+		/*
+		 * If there is no asynchronous activity still pending and the
+		 * synchronous activity is also complete, we're totally done scanning
+		 * this node.  Otherwise, we're done with the asynchronous stuff but
+		 * must continue scanning the synchronous children.
+		 */
+		if (node->as_syncdone)
+		{
+			Assert(bms_is_empty(node->as_pending_async));
+			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		/*
+		 * get a tuple from the subplan
+		 */
+		result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+		if (!TupIsNull(result))
+		{
+			/*
+			 * If the subplan gave us something then return it as-is. We do
+			 * NOT make use of the result slot that was set up in
+			 * ExecInitAppend; there's no need for it.
+			 */
+			return result;
+		}
+
+		/*
+		 * Go on to the "next" subplan in the appropriate direction. If no
+		 * more subplans, return the empty slot set up for us by
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
+		 */
+		if (!node->choose_next_subplan(node))
+		{
+			node->as_syncdone = true;
+			if (bms_is_empty(node->as_pending_async))
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/* Else loop back and try to get a tuple from the new subplan */
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecEndAppend
  *
@@ -257,6 +427,15 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+	{
+		ExecShutdownNode(node->appendplans[i]);
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	}
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -276,7 +455,7 @@ ExecReScanAppend(AppendState *node)
 			ExecReScan(subnode);
 	}
 
-	node->as_whichplan =
+	node->as_whichsyncplan =
 		node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
 }
 
@@ -365,7 +544,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-	int			whichplan = node->as_whichplan;
+	int			whichplan = node->as_whichsyncplan;
 
 	/* We should never see INVALID_SUBPLAN_INDEX in this case. */
 	Assert(whichplan >= 0 && whichplan <= node->as_nplans);
@@ -374,13 +553,13 @@ choose_next_subplan_locally(AppendState *node)
 	{
 		if (whichplan >= node->as_nplans - 1)
 			return false;
-		node->as_whichplan++;
+		node->as_whichsyncplan++;
 	}
 	else
 	{
 		if (whichplan <= 0)
 			return false;
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	return true;
@@ -405,33 +584,33 @@ choose_next_subplan_for_leader(AppendState *node)
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
 	{
 		/* Mark just-completed subplan as finished. */
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 	}
 	else
 	{
 		/* Start with last subplan. */
-		node->as_whichplan = node->as_nplans - 1;
+		node->as_whichsyncplan = node->as_nplans - 1;
 	}
 
 	/* Loop until we find a subplan to execute. */
-	while (pstate->pa_finished[node->as_whichplan])
+	while (pstate->pa_finished[node->as_whichsyncplan])
 	{
-		if (node->as_whichplan == 0)
+		if (node->as_whichsyncplan == 0)
 		{
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-			node->as_whichplan = INVALID_SUBPLAN_INDEX;
+			node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 			LWLockRelease(&pstate->pa_lock);
 			return false;
 		}
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < append->first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
@@ -464,8 +643,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
 	/* Mark just-completed subplan as finished. */
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	/* If all the plans are already done, we have nothing to do */
 	if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX)
@@ -490,10 +669,10 @@ choose_next_subplan_for_worker(AppendState *node)
 		else
 		{
 			/* At last plan, no partial plans, arrange to bail out. */
-			pstate->pa_next_plan = node->as_whichplan;
+			pstate->pa_next_plan = node->as_whichsyncplan;
 		}
 
-		if (pstate->pa_next_plan == node->as_whichplan)
+		if (pstate->pa_next_plan == node->as_whichsyncplan)
 		{
 			/* We've tried everything! */
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -503,7 +682,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Pick the plan we found, and advance pa_next_plan one more time. */
-	node->as_whichplan = pstate->pa_next_plan++;
+	node->as_whichsyncplan = pstate->pa_next_plan++;
 	if (pstate->pa_next_plan >= node->as_nplans)
 	{
 		if (append->first_partial_plan < node->as_nplans)
@@ -519,8 +698,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < append->first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index dc6cfcf..afc8a58 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
 					(ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
 	scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+	scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+	if ((eflags & EXEC_FLAG_ASYNC) != 0)
+		scanstate->fs_async = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -389,3 +392,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+							  void *caller_data, bool reinit)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+												 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index f6c83d0..402db1e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels);
+						   int nasyncplans,	int referent,
+						   List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -284,6 +285,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1014,8 +1016,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1050,7 +1056,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/*
+		 * Classify as async-capable or not. If we have decided to run the
+		 * chidlren in parallel, we cannot any one of them run asynchronously.
+		 */
+		if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1060,8 +1080,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans),
+					   best_path->first_partial_path, nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5296,8 +5318,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+			int referent, List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5309,6 +5331,8 @@ make_append(List *appendplans, int first_partial_plan,
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
 	node->first_partial_plan = first_partial_plan;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6646,3 +6670,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 5c256ff..09ea33b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3628,6 +3628,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+								   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+									 long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index b5578f5..bd622c9 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 152abf0..1d95e39 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+										  WaitEventSet *wes,
+										  void *caller_data, bool reinit);
 
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index 04e43cc..566236b 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+													WaitEventSet *wes,
+													void *caller_data,
+													bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	IterateForeignScan_function IterateForeignScanAsync;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 
 	/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 1a35c5c..c049251 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -843,6 +843,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+	AS_AVAILABLE,
+	AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
 	NodeTag		type;
@@ -883,6 +889,9 @@ typedef struct PlanState
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
+
+	AsyncState	asyncstate;
+	int32		padding;			/* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1012,10 +1021,16 @@ struct AppendState
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
-	int			as_whichplan;
+	int			as_nasyncplans;	/* # of async-capable children */
 	ParallelAppendState *as_pstate; /* parallel coordination info */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
 	Size		pstate_len;		/* size of parallel coordination info */
 	bool		(*choose_next_subplan) (AppendState *);
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	Bitmapset  *as_pending_async;	/* pending async plans */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1566,6 +1581,7 @@ typedef struct ForeignScanState
 	Size		pscan_len;		/* size of parallel coordination information */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	bool		fs_async;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 02fb366..a6df261 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -249,6 +249,8 @@ typedef struct Append
 	List	   *partitioned_rels;
 	List	   *appendplans;
 	int			first_partial_plan;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 089b7c3..fe9d39c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -816,7 +816,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

From d0882fbc09fce447e642278292b70e4a6b73575e Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 4fbf043..646085f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
+		entry->storage = NULL;
 	}
 
 	/*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	bool		found;
+	ConnCacheEntry *entry;
+	ConnCacheKey key;
+
+	key = user->umid;
+	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+	Assert(found);
+
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641..3b4eefa 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6514,7 +6514,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6542,7 +6542,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6570,7 +6570,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6598,7 +6598,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6664,35 +6664,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6702,35 +6707,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6760,11 +6770,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6778,11 +6788,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6813,16 +6823,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6840,16 +6850,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -7000,27 +7010,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index fb65e2e..0688504 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+											  WaitEventSet *wes,
+											  void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					node->ss.ps.asyncstate = AS_WAITING;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_conn_owner->fs_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
+	node->ss.ps.asyncstate = AS_AVAILABLE;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+								  void *caller_data, bool reinit)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(wes,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, caller_data);
+		return true;
+	}
+
+	return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 788b003..41ac1d2 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c7..cb9caa5 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.9.2

#62

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

about 8 years ago

In reply to: Kyotaro HORIGUCHI (#61)

3 attachment(s)

Re: [HACKERS] asynchronous execution

At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171211.200753.191768178.horiguchi.kyotaro@lab.ntt.co.jp>

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases.

The parallel append just committed hit this and the attached are
the rebased version to the current HEAD. The result of a concise
performance test follows.

patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3562.32 3444.81 -3.4
B: local partitioning : 1451.25 1604.38 9.5
C: single remote table : 8818.92 9297.76 5.1
D: sharding (single con) : 5966.14 6646.73 10.2
E: sharding (multi con) : 1802.25 6515.49 72.3

A and B are degradation checks, which are expected to show no
degradation. C is the gain only by postgres_fdw's command
presending on a remote table. D is the gain of sharding on a
connection. The number of partitions/shards is 4. E is the gain
using dedicate connection per shard.

Test A is accelerated by parallel sequential scan. Introducing
parallel append accelerates test B. Comparing A and B, I doubt
that degradation is stably measurable at least my environment but
I believe that there's no degradation theoreticaly. The test C to
E still shows apparent gain.
regards,

The patch conflicts with 3cac0ec. This is the rebased version.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From be22b33b90abec93a2a609a1db4955e6910b2da0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4d..890972b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7..5457899 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5ba..30edc8e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	{
 		WaitEventSet *new_event_set;
 
-		new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+		new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
 		AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 		AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e09a4f1..7ae8777 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48..838845a 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 22b377c..56f2059 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-core-side-modification.patchtext/x-patch; charset=us-asciiDownload

From 885f62d89a93edbda44330c3ecc3a7ac08e302ea Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 110 ++++++++++++++
 src/backend/executor/nodeAppend.c       | 247 +++++++++++++++++++++++++++-----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/optimizer/plan/createplan.c |  62 +++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/include/executor/execAsync.h        |  23 +++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 13 files changed, 462 insertions(+), 45 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..8ad2adf 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..f7daed7
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,110 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+	pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+					   void *data, bool reinit)
+{
+	switch (nodeTag(node))
+	{
+	case T_ForeignScanState:
+		return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+											 wes, data, reinit);
+		break;
+	default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(node));
+	}
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+	static int *refind = NULL;
+	static int refindsize = 0;
+	WaitEventSet *wes;
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int noccurred = 0;
+	Bitmapset *fired_events = NULL;
+	int i;
+	int n;
+
+	n = bms_num_members(waitnodes);
+	wes = CreateWaitEventSet(TopTransactionContext,
+							 TopTransactionResourceOwner, n);
+	if (refindsize < n)
+	{
+		if (refindsize == 0)
+			refindsize = EVENT_BUFFER_SIZE; /* XXX */
+		while (refindsize < n)
+			refindsize *= 2;
+		if (refind)
+			refind = (int *) repalloc(refind, refindsize * sizeof(int));
+		else
+			refind = (int *) palloc(refindsize * sizeof(int));
+	}
+
+	n = 0;
+	for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+		 i = bms_next_member(waitnodes, i))
+	{
+		refind[i] = i;
+		if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+			n++;
+	}
+
+	if (n == 0)
+	{
+		FreeWaitEventSet(wes);
+		return NULL;
+	}
+
+	noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+								 EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+	FreeWaitEventSet(wes);
+	if (noccurred == 0)
+		return NULL;
+
+	for (i = 0 ; i < noccurred ; i++)
+	{
+		WaitEvent *w = &occurred_event[i];
+
+		if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		{
+			int n = *(int*)w->user_data;
+
+			fired_events = bms_add_member(fired_events, n);
+		}
+	}
+
+	return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 64a17fb..644af5b 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,7 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -79,6 +80,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX		-1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ListCell   *lc;
 
 	/* check for unsupported flags */
-	Assert(!(eflags & EXEC_FLAG_MARK));
+	Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
 	/*
 	 * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ExecProcNode = ExecAppend;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* Choose async version of Exec function */
+	if (appendstate->as_nasyncplans > 0)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Miscellaneous initialization
@@ -149,27 +164,48 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	foreach(lc, node->appendplans)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
+		int			sub_eflags = eflags;
+
+		if (i < appendstate->as_nasyncplans)
+			sub_eflags |= EXEC_FLAG_ASYNC;
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
 		i++;
 	}
 
+	/* if there's any async-capable subnode, use async-aware routine */
+	if (appendstate->as_nasyncplans)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
 	/*
 	 * initialize output tuple type
 	 */
 	ExecAssignResultTypeFromTL(&appendstate->ps);
 	appendstate->ps.ps_ProjInfo = NULL;
 
-	/*
-	 * Parallel-aware append plans must choose the first subplan to execute by
-	 * looking at shared memory, but non-parallel-aware append plans can
-	 * always start with the first subplan.
-	 */
-	appendstate->as_whichplan =
-		appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
+	if (appendstate->ps.plan->parallel_aware)
+	{
+		/*
+		 * Parallel-aware append plans must choose the first subplan to
+		 * execute by looking at shared memory, but non-parallel-aware append
+		 * plans can always start with the first subplan.
+		 */
 
-	/* If parallel-aware, this will be overridden later. */
-	appendstate->choose_next_subplan = choose_next_subplan_locally;
+		appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
+
+		/* If parallel-aware, this will be overridden later. */
+		appendstate->choose_next_subplan = choose_next_subplan_locally;
+	}
+	else
+	{
+		appendstate->as_whichsyncplan = 0;
+
+		/*
+		 * initialize to scan first synchronous subplan
+		 */
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
+		appendstate->choose_next_subplan = choose_next_subplan_locally;
+	}
 
 	return appendstate;
 }
@@ -186,10 +222,12 @@ ExecAppend(PlanState *pstate)
 	AppendState *node = castNode(AppendState, pstate);
 
 	/* If no subplan has been chosen, we must choose one before proceeding. */
-	if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+	if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
 		!node->choose_next_subplan(node))
 		return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
+	Assert(node->as_nasyncplans == 0);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -200,8 +238,9 @@ ExecAppend(PlanState *pstate)
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(node->as_whichsyncplan >= 0 &&
+			   node->as_whichsyncplan < node->as_nplans);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -224,6 +263,137 @@ ExecAppend(PlanState *pstate)
 	}
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+	AppendState *node = castNode(AppendState, pstate);
+	Bitmapset *needrequest;
+	int	i;
+
+	Assert(node->as_nasyncplans > 0);
+
+	if (node->as_nasyncresult > 0)
+	{
+		--node->as_nasyncresult;
+		return node->as_asyncresult[node->as_nasyncresult];
+	}
+
+	needrequest = node->as_needrequest;
+	node->as_needrequest = NULL;
+	while ((i = bms_first_member(needrequest)) >= 0)
+	{
+		TupleTableSlot *slot;
+		PlanState *subnode = node->appendplans[i];
+
+		slot = ExecProcNode(subnode);
+		if (subnode->asyncstate == AS_AVAILABLE)
+		{
+			if (!TupIsNull(slot))
+			{
+				node->as_asyncresult[node->as_nasyncresult++] = slot;
+				node->as_needrequest = bms_add_member(node->as_needrequest, i);
+			}
+		}
+		else
+			node->as_pending_async = bms_add_member(node->as_pending_async, i);
+	}
+	bms_free(needrequest);
+
+	for (;;)
+	{
+		TupleTableSlot *result;
+
+		/* return now if a result is available */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		while (!bms_is_empty(node->as_pending_async))
+		{
+			long timeout = node->as_syncdone ? -1 : 0;
+			Bitmapset *fired;
+			int i;
+
+			fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+									   timeout);
+			while ((i = bms_first_member(fired)) >= 0)
+			{
+				TupleTableSlot *slot;
+				PlanState *subnode = node->appendplans[i];
+				slot = ExecProcNode(subnode);
+				if (subnode->asyncstate == AS_AVAILABLE)
+				{
+					if (!TupIsNull(slot))
+					{
+						node->as_asyncresult[node->as_nasyncresult++] = slot;
+						node->as_needrequest =
+							bms_add_member(node->as_needrequest, i);
+					}
+					node->as_pending_async =
+						bms_del_member(node->as_pending_async, i);
+				}
+			}
+			bms_free(fired);
+
+			/* return now if a result is available */
+			if (node->as_nasyncresult > 0)
+			{
+				--node->as_nasyncresult;
+				return node->as_asyncresult[node->as_nasyncresult];
+			}
+
+			if (!node->as_syncdone)
+				break;
+		}
+
+		/*
+		 * If there is no asynchronous activity still pending and the
+		 * synchronous activity is also complete, we're totally done scanning
+		 * this node.  Otherwise, we're done with the asynchronous stuff but
+		 * must continue scanning the synchronous children.
+		 */
+		if (node->as_syncdone)
+		{
+			Assert(bms_is_empty(node->as_pending_async));
+			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		/*
+		 * get a tuple from the subplan
+		 */
+		result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+		if (!TupIsNull(result))
+		{
+			/*
+			 * If the subplan gave us something then return it as-is. We do
+			 * NOT make use of the result slot that was set up in
+			 * ExecInitAppend; there's no need for it.
+			 */
+			return result;
+		}
+
+		/*
+		 * Go on to the "next" subplan in the appropriate direction. If no
+		 * more subplans, return the empty slot set up for us by
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
+		 */
+		if (!node->choose_next_subplan(node))
+		{
+			node->as_syncdone = true;
+			if (bms_is_empty(node->as_pending_async))
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/* Else loop back and try to get a tuple from the new subplan */
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecEndAppend
  *
@@ -257,6 +427,15 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+	{
+		ExecShutdownNode(node->appendplans[i]);
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	}
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -276,7 +455,7 @@ ExecReScanAppend(AppendState *node)
 			ExecReScan(subnode);
 	}
 
-	node->as_whichplan =
+	node->as_whichsyncplan =
 		node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
 }
 
@@ -365,7 +544,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-	int			whichplan = node->as_whichplan;
+	int			whichplan = node->as_whichsyncplan;
 
 	/* We should never see INVALID_SUBPLAN_INDEX in this case. */
 	Assert(whichplan >= 0 && whichplan <= node->as_nplans);
@@ -374,13 +553,13 @@ choose_next_subplan_locally(AppendState *node)
 	{
 		if (whichplan >= node->as_nplans - 1)
 			return false;
-		node->as_whichplan++;
+		node->as_whichsyncplan++;
 	}
 	else
 	{
 		if (whichplan <= 0)
 			return false;
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	return true;
@@ -405,33 +584,33 @@ choose_next_subplan_for_leader(AppendState *node)
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
 	{
 		/* Mark just-completed subplan as finished. */
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 	}
 	else
 	{
 		/* Start with last subplan. */
-		node->as_whichplan = node->as_nplans - 1;
+		node->as_whichsyncplan = node->as_nplans - 1;
 	}
 
 	/* Loop until we find a subplan to execute. */
-	while (pstate->pa_finished[node->as_whichplan])
+	while (pstate->pa_finished[node->as_whichsyncplan])
 	{
-		if (node->as_whichplan == 0)
+		if (node->as_whichsyncplan == 0)
 		{
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-			node->as_whichplan = INVALID_SUBPLAN_INDEX;
+			node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 			LWLockRelease(&pstate->pa_lock);
 			return false;
 		}
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < append->first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
@@ -463,8 +642,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
 	/* Mark just-completed subplan as finished. */
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	/* If all the plans are already done, we have nothing to do */
 	if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX)
@@ -489,10 +668,10 @@ choose_next_subplan_for_worker(AppendState *node)
 		else
 		{
 			/* At last plan, no partial plans, arrange to bail out. */
-			pstate->pa_next_plan = node->as_whichplan;
+			pstate->pa_next_plan = node->as_whichsyncplan;
 		}
 
-		if (pstate->pa_next_plan == node->as_whichplan)
+		if (pstate->pa_next_plan == node->as_whichsyncplan)
 		{
 			/* We've tried everything! */
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -502,7 +681,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Pick the plan we found, and advance pa_next_plan one more time. */
-	node->as_whichplan = pstate->pa_next_plan++;
+	node->as_whichsyncplan = pstate->pa_next_plan++;
 	if (pstate->pa_next_plan >= node->as_nplans)
 	{
 		if (append->first_partial_plan < node->as_nplans)
@@ -518,8 +697,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < append->first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 59865f5..9cb5470 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
 					(ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
 	scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+	scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+	if ((eflags & EXEC_FLAG_ASYNC) != 0)
+		scanstate->fs_async = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -389,3 +392,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+							  void *caller_data, bool reinit)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+												 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e599283..d85cb9c 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels);
+						   int nasyncplans,	int referent,
+						   List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -284,6 +285,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1014,8 +1016,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1050,7 +1056,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/*
+		 * Classify as async-capable or not. If we have decided to run the
+		 * chidlren in parallel, we cannot any one of them run asynchronously.
+		 */
+		if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1060,8 +1080,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans),
+					   best_path->first_partial_path, nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5307,8 +5329,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+			int referent, List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5320,6 +5342,8 @@ make_append(List *appendplans, int first_partial_plan,
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
 	node->first_partial_plan = first_partial_plan;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6656,3 +6680,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d130114..667878b 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3673,6 +3673,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+								   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+									 long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 6545a80..60f4e51 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be..67abf8e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+										  WaitEventSet *wes,
+										  void *caller_data, bool reinit);
 
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e88fee3..beb3f0d 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+													WaitEventSet *wes,
+													void *caller_data,
+													bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	IterateForeignScan_function IterateForeignScanAsync;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 
 	/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4bb5cb1..405ad7b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -851,6 +851,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+	AS_AVAILABLE,
+	AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
 	NodeTag		type;
@@ -891,6 +897,9 @@ typedef struct PlanState
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
+
+	AsyncState	asyncstate;
+	int32		padding;			/* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1013,10 +1022,16 @@ struct AppendState
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
-	int			as_whichplan;
+	int			as_nasyncplans;	/* # of async-capable children */
 	ParallelAppendState *as_pstate; /* parallel coordination info */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
 	Size		pstate_len;		/* size of parallel coordination info */
 	bool		(*choose_next_subplan) (AppendState *);
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	Bitmapset  *as_pending_async;	/* pending async plans */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1567,6 +1582,7 @@ typedef struct ForeignScanState
 	Size		pscan_len;		/* size of parallel coordination information */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	bool		fs_async;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 74e9fb5..b4535f0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -249,6 +249,8 @@ typedef struct Append
 	List	   *partitioned_rels;
 	List	   *appendplans;
 	int			first_partial_plan;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 3d3c0b6..a1ba26f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -831,7 +831,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

From 6612fbe0cab492fedead1d35f1b9cdf24f3e6dd4 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 00c926b..4f3d59d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
+		entry->storage = NULL;
 	}
 
 	/*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	bool		found;
+	ConnCacheEntry *entry;
+	ConnCacheKey key;
+
+	key = user->umid;
+	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+	Assert(found);
+
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 683d641..3b4eefa 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6514,7 +6514,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6542,7 +6542,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6570,7 +6570,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6598,7 +6598,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6664,35 +6664,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6702,35 +6707,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -6760,11 +6770,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -6778,11 +6788,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -6813,16 +6823,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -6840,16 +6850,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -7000,27 +7010,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 7992ba5..5ea1d88 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -288,6 +319,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -348,6 +380,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+											  WaitEventSet *wes,
+											  void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -368,7 +404,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -438,6 +477,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -472,6 +512,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1322,12 +1366,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1383,32 +1436,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					node->ss.ps.asyncstate = AS_WAITING;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+			return ExecClearTuple(slot);
+		}
+
+		/* At this time no node is running on the connection */
+		Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+			   == NULL);
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_conn_owner->fs_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
+	node->ss.ps.asyncstate = AS_AVAILABLE;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1424,7 +1581,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1432,6 +1589,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1460,9 +1620,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1480,7 +1640,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1488,16 +1648,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1700,7 +1876,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1779,6 +1957,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1789,14 +1969,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1804,10 +1984,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1845,6 +2025,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1865,14 +2047,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1880,10 +2062,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1921,6 +2103,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1941,14 +2125,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1956,10 +2140,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2006,16 +2190,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2303,7 +2487,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2356,7 +2542,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2403,8 +2592,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2523,6 +2712,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2565,6 +2755,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2919,11 +3119,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2989,47 +3189,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3039,27 +3288,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3143,7 +3447,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3153,12 +3457,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3166,9 +3470,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3299,9 +3603,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3309,10 +3613,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4582,6 +4886,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+								  void *caller_data, bool reinit)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(wes,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, caller_data);
+		return true;
+	}
+
+	return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
@@ -4946,7 +5286,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 1ae809d..58ef26e 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 3c3c5c7..cb9caa5 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1535,25 +1535,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1589,12 +1589,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1653,8 +1653,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.9.2

#63

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

almost 8 years ago

In reply to: Kyotaro HORIGUCHI (#62)

3 attachment(s)

Re: [HACKERS] asynchronous execution

At Thu, 11 Jan 2018 17:08:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180111.170839.23674040.horiguchi.kyotaro@lab.ntt.co.jp>

At Mon, 11 Dec 2017 20:07:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20171211.200753.191768178.horiguchi.kyotaro@lab.ntt.co.jp>

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases.

The parallel append just committed hit this and the attached are
the rebased version to the current HEAD. The result of a concise
performance test follows.

patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3562.32 3444.81 -3.4
B: local partitioning : 1451.25 1604.38 9.5
C: single remote table : 8818.92 9297.76 5.1
D: sharding (single con) : 5966.14 6646.73 10.2
E: sharding (multi con) : 1802.25 6515.49 72.3

A and B are degradation checks, which are expected to show no
degradation. C is the gain only by postgres_fdw's command
presending on a remote table. D is the gain of sharding on a
connection. The number of partitions/shards is 4. E is the gain
using dedicate connection per shard.

Test A is accelerated by parallel sequential scan. Introducing
parallel append accelerates test B. Comparing A and B, I doubt
that degradation is stably measurable at least my environment but
I believe that there's no degradation theoreticaly. The test C to
E still shows apparent gain.
regards,

The patch conflicts with 3cac0ec. This is the rebased version.

It hadn't been workable itself for a long time.

- Rebased to current master.
(Removed some wrongly-inserted lines)
- Fixed wrong-positioned assertion in postgres_fdw.c
(Caused assertion failure on normal usage)
- Properly reset persistent (static) variable.
(Caused SEGV under certain condition)
- Fixed explain output of async-mixed append plan.
(Choose proper subnode as the referent node)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 6ab58d3fb02f716deaa207824747646dd8c2a448 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 68 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 97 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4d..890972b 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7..5457899 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5ba..30edc8e 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	{
 		WaitEventSet *new_event_set;
 
-		new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+		new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
 		AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 		AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index e09a4f1..7ae8777 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -124,6 +124,7 @@ typedef struct ResourceOwnerData
 	ResourceArray snapshotarr;	/* snapshot references */
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -169,6 +170,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -437,6 +439,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->snapshotarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -538,6 +541,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 				PrintDSMLeakWarning(res);
 			dsm_detach(res);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -685,6 +698,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->snapshotarr.nitems == 0);
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -711,6 +725,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->snapshotarr));
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1253,3 +1268,56 @@ PrintDSMLeakWarning(dsm_segment *seg)
 	elog(WARNING, "dynamic shared memory leak: segment %u still referenced",
 		 dsm_segment_handle(seg));
 }
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48..838845a 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 22b377c..56f2059 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -88,4 +89,11 @@ extern void ResourceOwnerRememberDSM(ResourceOwner owner,
 extern void ResourceOwnerForgetDSM(ResourceOwner owner,
 					   dsm_segment *);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.9.2

0002-core-side-modification.patchtext/x-patch; charset=us-asciiDownload

From 60c663b3059e10302a71023eccb275da51331b39 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 145 ++++++++++++++++++++
 src/backend/executor/nodeAppend.c       | 228 ++++++++++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/optimizer/plan/createplan.c |  62 ++++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/include/executor/execAsync.h        |  23 ++++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/execnodes.h           |  18 ++-
 src/include/nodes/plannodes.h           |   2 +
 src/include/pgstat.h                    |   3 +-
 14 files changed, 489 insertions(+), 42 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895..8ad2adf 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..db477e2
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,145 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+	pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+					   void *data, bool reinit)
+{
+	switch (nodeTag(node))
+	{
+	case T_ForeignScanState:
+		return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+											 wes, data, reinit);
+		break;
+	default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(node));
+	}
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+	int **p_refind;
+	int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+	/* arg is the address of the variable refind in ExecAsyncEventWait */
+	ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+	*mcbarg->p_refind = NULL;
+	*mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+	static int *refind = NULL;
+	static int refindsize = 0;
+	WaitEventSet *wes;
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int noccurred = 0;
+	Bitmapset *fired_events = NULL;
+	int i;
+	int n;
+
+	n = bms_num_members(waitnodes);
+	wes = CreateWaitEventSet(TopTransactionContext,
+							 TopTransactionResourceOwner, n);
+	if (refindsize < n)
+	{
+		if (refindsize == 0)
+			refindsize = EVENT_BUFFER_SIZE; /* XXX */
+		while (refindsize < n)
+			refindsize *= 2;
+		if (refind)
+			refind = (int *) repalloc(refind, refindsize * sizeof(int));
+		else
+		{
+			static ExecAsync_mcbarg mcb_arg =
+				{ &refind, &refindsize };
+			static MemoryContextCallback mcb =
+				{ ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+			MemoryContext oldctxt =
+				MemoryContextSwitchTo(TopTransactionContext);
+
+			/*
+			 * refind points to a memory block in
+			 * TopTransactionContext. Register a callback to reset it.
+			 */
+			MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+			refind = (int *) palloc(refindsize * sizeof(int));
+			MemoryContextSwitchTo(oldctxt);
+		}
+	}
+
+	n = 0;
+	for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+		 i = bms_next_member(waitnodes, i))
+	{
+		refind[i] = i;
+		if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+			n++;
+	}
+
+	if (n == 0)
+	{
+		FreeWaitEventSet(wes);
+		return NULL;
+	}
+
+	noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+								 EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+	FreeWaitEventSet(wes);
+	if (noccurred == 0)
+		return NULL;
+
+	for (i = 0 ; i < noccurred ; i++)
+	{
+		WaitEvent *w = &occurred_event[i];
+
+		if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		{
+			int n = *(int*)w->user_data;
+
+			fired_events = bms_add_member(fired_events, n);
+		}
+	}
+
+	return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 7a3dd2e..df1f7ae 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -59,6 +59,7 @@
 
 #include "executor/execdebug.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -79,6 +80,7 @@ struct ParallelAppendState
 #define INVALID_SUBPLAN_INDEX		-1
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,7 +106,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	ListCell   *lc;
 
 	/* check for unsupported flags */
-	Assert(!(eflags & EXEC_FLAG_MARK));
+	Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
 	/*
 	 * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -127,6 +129,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->ps.ExecProcNode = ExecAppend;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
+	appendstate->as_nasyncplans = node->nasyncplans;
+	appendstate->as_syncdone = (node->nasyncplans == nplans);
+	appendstate->as_asyncresult = (TupleTableSlot **)
+		palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+	/* Choose async version of Exec function */
+	if (appendstate->as_nasyncplans > 0)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
+	/* initially, all async requests need a request */
+	for (i = 0; i < appendstate->as_nasyncplans; ++i)
+		appendstate->as_needrequest =
+			bms_add_member(appendstate->as_needrequest, i);
 
 	/*
 	 * Initialize result tuple type and slot.
@@ -141,11 +156,19 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	foreach(lc, node->appendplans)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
+		int			sub_eflags = eflags;
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		if (i < appendstate->as_nasyncplans)
+			sub_eflags |= EXEC_FLAG_ASYNC;
+
+		appendplanstates[i] = ExecInitNode(initNode, estate, sub_eflags);
 		i++;
 	}
 
+	/* if there's any async-capable subnode, use async-aware routine */
+	if (appendstate->as_nasyncplans)
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
+
 	/*
 	 * Miscellaneous initialization
 	 *
@@ -159,8 +182,12 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 * looking at shared memory, but non-parallel-aware append plans can
 	 * always start with the first subplan.
 	 */
-	appendstate->as_whichplan =
-		appendstate->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
+	if (appendstate->ps.plan->parallel_aware)
+		appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
+	else if (appendstate->as_nasyncplans > 0)
+		appendstate->as_whichsyncplan = appendstate->as_nasyncplans;
+	else
+		appendstate->as_whichsyncplan = 0;
 
 	/* If parallel-aware, this will be overridden later. */
 	appendstate->choose_next_subplan = choose_next_subplan_locally;
@@ -180,10 +207,12 @@ ExecAppend(PlanState *pstate)
 	AppendState *node = castNode(AppendState, pstate);
 
 	/* If no subplan has been chosen, we must choose one before proceeding. */
-	if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+	if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
 		!node->choose_next_subplan(node))
 		return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
+	Assert(node->as_nasyncplans == 0);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -194,8 +223,9 @@ ExecAppend(PlanState *pstate)
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(node->as_whichsyncplan >= 0 &&
+			   node->as_whichsyncplan < node->as_nplans);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -218,6 +248,137 @@ ExecAppend(PlanState *pstate)
 	}
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+	AppendState *node = castNode(AppendState, pstate);
+	Bitmapset *needrequest;
+	int	i;
+
+	Assert(node->as_nasyncplans > 0);
+
+	if (node->as_nasyncresult > 0)
+	{
+		--node->as_nasyncresult;
+		return node->as_asyncresult[node->as_nasyncresult];
+	}
+
+	needrequest = node->as_needrequest;
+	node->as_needrequest = NULL;
+	while ((i = bms_first_member(needrequest)) >= 0)
+	{
+		TupleTableSlot *slot;
+		PlanState *subnode = node->appendplans[i];
+
+		slot = ExecProcNode(subnode);
+		if (subnode->asyncstate == AS_AVAILABLE)
+		{
+			if (!TupIsNull(slot))
+			{
+				node->as_asyncresult[node->as_nasyncresult++] = slot;
+				node->as_needrequest = bms_add_member(node->as_needrequest, i);
+			}
+		}
+		else
+			node->as_pending_async = bms_add_member(node->as_pending_async, i);
+	}
+	bms_free(needrequest);
+
+	for (;;)
+	{
+		TupleTableSlot *result;
+
+		/* return now if a result is available */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		while (!bms_is_empty(node->as_pending_async))
+		{
+			long timeout = node->as_syncdone ? -1 : 0;
+			Bitmapset *fired;
+			int i;
+
+			fired = ExecAsyncEventWait(node->appendplans, node->as_pending_async,
+									   timeout);
+			while ((i = bms_first_member(fired)) >= 0)
+			{
+				TupleTableSlot *slot;
+				PlanState *subnode = node->appendplans[i];
+				slot = ExecProcNode(subnode);
+				if (subnode->asyncstate == AS_AVAILABLE)
+				{
+					if (!TupIsNull(slot))
+					{
+						node->as_asyncresult[node->as_nasyncresult++] = slot;
+						node->as_needrequest =
+							bms_add_member(node->as_needrequest, i);
+					}
+					node->as_pending_async =
+						bms_del_member(node->as_pending_async, i);
+				}
+			}
+			bms_free(fired);
+
+			/* return now if a result is available */
+			if (node->as_nasyncresult > 0)
+			{
+				--node->as_nasyncresult;
+				return node->as_asyncresult[node->as_nasyncresult];
+			}
+
+			if (!node->as_syncdone)
+				break;
+		}
+
+		/*
+		 * If there is no asynchronous activity still pending and the
+		 * synchronous activity is also complete, we're totally done scanning
+		 * this node.  Otherwise, we're done with the asynchronous stuff but
+		 * must continue scanning the synchronous children.
+		 */
+		if (node->as_syncdone)
+		{
+			Assert(bms_is_empty(node->as_pending_async));
+			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		/*
+		 * get a tuple from the subplan
+		 */
+		result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+		if (!TupIsNull(result))
+		{
+			/*
+			 * If the subplan gave us something then return it as-is. We do
+			 * NOT make use of the result slot that was set up in
+			 * ExecInitAppend; there's no need for it.
+			 */
+			return result;
+		}
+
+		/*
+		 * Go on to the "next" subplan in the appropriate direction. If no
+		 * more subplans, return the empty slot set up for us by
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
+		 */
+		if (!node->choose_next_subplan(node))
+		{
+			node->as_syncdone = true;
+			if (bms_is_empty(node->as_pending_async))
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/* Else loop back and try to get a tuple from the new subplan */
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecEndAppend
  *
@@ -251,6 +412,15 @@ ExecReScanAppend(AppendState *node)
 {
 	int			i;
 
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+	{
+		ExecShutdownNode(node->appendplans[i]);
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	}
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -270,7 +440,7 @@ ExecReScanAppend(AppendState *node)
 			ExecReScan(subnode);
 	}
 
-	node->as_whichplan =
+	node->as_whichsyncplan =
 		node->ps.plan->parallel_aware ? INVALID_SUBPLAN_INDEX : 0;
 }
 
@@ -359,7 +529,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-	int			whichplan = node->as_whichplan;
+	int			whichplan = node->as_whichsyncplan;
 
 	/* We should never see INVALID_SUBPLAN_INDEX in this case. */
 	Assert(whichplan >= 0 && whichplan <= node->as_nplans);
@@ -368,13 +538,13 @@ choose_next_subplan_locally(AppendState *node)
 	{
 		if (whichplan >= node->as_nplans - 1)
 			return false;
-		node->as_whichplan++;
+		node->as_whichsyncplan++;
 	}
 	else
 	{
 		if (whichplan <= 0)
 			return false;
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	return true;
@@ -399,33 +569,33 @@ choose_next_subplan_for_leader(AppendState *node)
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
 	{
 		/* Mark just-completed subplan as finished. */
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 	}
 	else
 	{
 		/* Start with last subplan. */
-		node->as_whichplan = node->as_nplans - 1;
+		node->as_whichsyncplan = node->as_nplans - 1;
 	}
 
 	/* Loop until we find a subplan to execute. */
-	while (pstate->pa_finished[node->as_whichplan])
+	while (pstate->pa_finished[node->as_whichsyncplan])
 	{
-		if (node->as_whichplan == 0)
+		if (node->as_whichsyncplan == 0)
 		{
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-			node->as_whichplan = INVALID_SUBPLAN_INDEX;
+			node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 			LWLockRelease(&pstate->pa_lock);
 			return false;
 		}
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < append->first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
@@ -457,8 +627,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
 	/* Mark just-completed subplan as finished. */
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	/* If all the plans are already done, we have nothing to do */
 	if (pstate->pa_next_plan == INVALID_SUBPLAN_INDEX)
@@ -468,7 +638,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Save the plan from which we are starting the search. */
-	node->as_whichplan = pstate->pa_next_plan;
+	node->as_whichsyncplan = pstate->pa_next_plan;
 
 	/* Loop until we find a subplan to execute. */
 	while (pstate->pa_finished[pstate->pa_next_plan])
@@ -478,7 +648,7 @@ choose_next_subplan_for_worker(AppendState *node)
 			/* Advance to next plan. */
 			pstate->pa_next_plan++;
 		}
-		else if (node->as_whichplan > append->first_partial_plan)
+		else if (node->as_whichsyncplan > append->first_partial_plan)
 		{
 			/* Loop back to first partial plan. */
 			pstate->pa_next_plan = append->first_partial_plan;
@@ -489,10 +659,10 @@ choose_next_subplan_for_worker(AppendState *node)
 			 * At last plan, and either there are no partial plans or we've
 			 * tried them all.  Arrange to bail out.
 			 */
-			pstate->pa_next_plan = node->as_whichplan;
+			pstate->pa_next_plan = node->as_whichsyncplan;
 		}
 
-		if (pstate->pa_next_plan == node->as_whichplan)
+		if (pstate->pa_next_plan == node->as_whichsyncplan)
 		{
 			/* We've tried everything! */
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -502,7 +672,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Pick the plan we found, and advance pa_next_plan one more time. */
-	node->as_whichplan = pstate->pa_next_plan++;
+	node->as_whichsyncplan = pstate->pa_next_plan++;
 	if (pstate->pa_next_plan >= node->as_nplans)
 	{
 		if (append->first_partial_plan < node->as_nplans)
@@ -518,8 +688,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < append->first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < append->first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 0084234..7da1ac5 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
 					(ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
 	scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+	scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+	if ((eflags & EXEC_FLAG_ASYNC) != 0)
+		scanstate->fs_async = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -383,3 +386,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+							  void *caller_data, bool reinit)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+												 caller_data, reinit);
+}
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index da0cc7f..24c838d 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -204,7 +204,8 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels);
+						   int nasyncplans,	int referent,
+						   List *tlist, List *partitioned_rels);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -287,6 +288,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1020,8 +1022,12 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1056,7 +1062,21 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/*
+		 * Classify as async-capable or not. If we have decided to run the
+		 * chidlren in parallel, we cannot any one of them run asynchronously.
+		 */
+		if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+		{
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	/*
@@ -1066,8 +1086,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, best_path->partitioned_rels);
+	plan = make_append(list_concat(asyncplans, syncplans),
+					   best_path->first_partial_path, nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5319,8 +5341,8 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+			int referent, List *tlist, List *partitioned_rels)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5332,6 +5354,8 @@ make_append(List *appendplans, int first_partial_plan,
 	node->partitioned_rels = partitioned_rels;
 	node->appendplans = appendplans;
 	node->first_partial_plan = first_partial_plan;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
 
 	return node;
 }
@@ -6677,3 +6701,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 96ba216..08eac23 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3676,6 +3676,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index ba9fab4..6837642 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4463,7 +4463,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	dpns->planstate = ps;
 
 	/*
-	 * We special-case Append and MergeAppend to pretend that the first child
+	 * We special-case Append and MergeAppend to pretend that a specific child
 	 * plan is the OUTER referent; we have to interpret OUTER Vars in their
 	 * tlists according to one of the children, and the first one is the most
 	 * natural choice.  Likewise special-case ModifyTable to pretend that the
@@ -4471,7 +4471,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		AppendState *aps = (AppendState *) ps;
+		Append *app = (Append *) ps->plan;
+		dpns->outer_planstate = aps->appendplans[app->referent];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..5fd67d9
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+								   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+									 long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 45a077a..54cc358 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -64,6 +64,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be..67abf8e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+										  WaitEventSet *wes,
+										  void *caller_data, bool reinit);
 
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e88fee3..beb3f0d 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -161,6 +161,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+													WaitEventSet *wes,
+													void *caller_data,
+													bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -182,6 +187,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	IterateForeignScan_function IterateForeignScanAsync;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
@@ -232,6 +238,11 @@ typedef struct FdwRoutine
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 
 	/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a953820..c9c3db2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -861,6 +861,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+	AS_AVAILABLE,
+	AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
 	NodeTag		type;
@@ -901,6 +907,9 @@ typedef struct PlanState
 	TupleTableSlot *ps_ResultTupleSlot; /* slot for my result tuples */
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
+
+	AsyncState	asyncstate;
+	int32		padding;			/* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1023,10 +1032,16 @@ struct AppendState
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
-	int			as_whichplan;
+	int			as_nasyncplans;	/* # of async-capable children */
 	ParallelAppendState *as_pstate; /* parallel coordination info */
+	int			as_whichsyncplan; /* which sync plan is being executed  */
 	Size		pstate_len;		/* size of parallel coordination info */
 	bool		(*choose_next_subplan) (AppendState *);
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	Bitmapset  *as_pending_async;	/* pending async plans */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1577,6 +1592,7 @@ typedef struct ForeignScanState
 	Size		pscan_len;		/* size of parallel coordination information */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	bool		fs_async;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2e19ea..64ee18e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -250,6 +250,8 @@ typedef struct Append
 	List	   *partitioned_rels;
 	List	   *appendplans;
 	int			first_partial_plan;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f592..6f4583b 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.9.2

0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

From c2195953a34fe7c0574631e5c118a948263dc755 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 128 ++++---
 contrib/postgres_fdw/postgres_fdw.c            | 484 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 522 insertions(+), 138 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 00c926b..4f3d59d 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
+		entry->storage = NULL;
 	}
 
 	/*
@@ -216,6 +218,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	bool		found;
+	ConnCacheEntry *entry;
+	ConnCacheKey key;
+
+	key = user->umid;
+	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+	Assert(found);
+
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 262c635..29ba813 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6790,7 +6790,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6818,7 +6818,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6846,7 +6846,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6874,7 +6874,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6940,35 +6940,40 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6978,35 +6983,40 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(28 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7036,11 +7046,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7054,11 +7064,11 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                            ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
 (39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
@@ -7089,16 +7099,16 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
                ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7116,16 +7126,16 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
                      ->  Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
                      ->  Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
 (45 rows)
 
 update bar set f2 = f2 + 100
@@ -7276,27 +7286,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 941a2e7..337c728 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -120,10 +126,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,13 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+								 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +193,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +220,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -293,6 +324,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -353,6 +385,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 UpperRelationKind stage,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+											  WaitEventSet *wes,
+											  void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -373,7 +409,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -452,6 +491,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -486,6 +526,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1336,12 +1380,21 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1397,32 +1450,136 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
-		if (fsstate->next_tuple >= fsstate->num_tuples)
+		ForeignScanState *next_conn_owner = node;
+
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connpriv->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (node->fs_async && !(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					fsstate->result_ready = false;
+					node->ss.ps.asyncstate = AS_WAITING;
+					return ExecClearTuple(slot);
+				}
+			}
+
+			Assert(fsstate->async_waiting);
+			fsstate->async_waiting = false;
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connpriv->current_owner &&
+				 !GetPgFdwScanState(node)->eof_reached)
+		{
+			/*
+			 * Anyone else is holding this connection and we want this node to
+			 * run later. Add myself to the tail of the waiters' list then
+			 * return not-ready.  To avoid scanning through the waiters' list,
+			 * the current owner is to maintain the shortcut to the last
+			 * waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connpriv->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
 			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			/* No one is running on this connection at this time */
+			Assert(GetPgFdwScanState(next_conn_owner)->s.connpriv->current_owner
+				   == NULL);
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+				next_owner_state->async_waiting = true;
+
+			if (!next_conn_owner->fs_async)
+				fetch_received_data(next_conn_owner);
+		}
+
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
+		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			fsstate->result_ready = fsstate->eof_reached;
+			node->ss.ps.asyncstate =
+				fsstate->result_ready ? AS_AVAILABLE : AS_WAITING;
+			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
+	node->ss.ps.asyncstate = AS_AVAILABLE;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1438,7 +1595,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1446,6 +1603,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1474,9 +1634,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1494,7 +1654,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1502,16 +1662,32 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1714,7 +1890,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1793,6 +1971,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1803,14 +1983,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1818,10 +1998,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1859,6 +2039,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1879,14 +2061,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1894,10 +2076,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1935,6 +2117,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1955,14 +2139,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1970,10 +2154,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2020,16 +2204,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2353,7 +2537,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Update the foreign-join-related fields. */
 	if (fsplan->scan.scanrelid == 0)
@@ -2438,7 +2624,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2485,8 +2674,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* close the target relation. */
 	if (dmstate->resultRel)
@@ -2609,6 +2798,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2651,6 +2841,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+			vacate_connection(&tmpstate);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3005,11 +3205,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -3075,47 +3275,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connpriv->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connpriv->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3125,27 +3374,82 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
 /*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *owner;
+
+	if (connpriv == NULL || connpriv->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connpriv->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connpriv->current_owner;
+
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connpriv->current_owner = NULL;
+		fsstate->async_waiting = false;
+	}
+}
+/*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
  *
@@ -3229,7 +3533,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3239,12 +3543,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3252,9 +3556,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3515,9 +3819,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3525,10 +3829,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -5007,6 +5311,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection owner. Elsewise
+ * another node on this connection is the owner.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+								  void *caller_data, bool reinit)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->current_owner == node)
+	{
+		AddWaitEventToSet(wes,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, caller_data);
+		return true;
+	}
+
+	return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index d37cc88..132367a 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 2863549..9ba8135 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1614,25 +1614,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1668,12 +1668,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1732,8 +1732,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.9.2

#64

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#63)

3 attachment(s)

Re: [HACKERS] asynchronous execution

Hello. This is the new version of $Subject.

But, this is not just a rebased version. On the way fixing
serious conflicts, I refactored patch and I believe this becomes
way readable than the previous shape.

# 0003 lacks changes of postgres_fdw.out now.

- Waiting queue manipulation is moved into new functions. It had
a bug that the same node can be inserted in the queue more than
once and it is fixed.

- postgresIterateForeginScan had somewhat a tricky strcuture to
merge similar procedures thus it cannot be said easy-to-read at
all. Now it is far simpler and straight-forward looking.

- Still this works only on Append/ForeignScan.

The attached PoC patch theoretically has no impact on the normal
code paths and just brings gain in async cases.

I performed almost the same test as before but with:

- partition tables
(There should be no difference with inheritance.)

- added test for fetch_size of 200 and 1000 as long as 100.

100 unreasonably magnifies the lag by context switching on
single poor box and the test D/F below. (They became faster by
about twice by adding a small delay (1000 times of
clock_gettime()(*1)) just before epoll_wait so that it doesn't
sleep (I suppose)...)

- Table size of test B is one tenth of the previous size, the
same to one partition.

*1: The reason for the function is that first I found that the
queries get way faster by just prefixing by "explain analyze"..

patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3562.32 3444.81 -3.4
B: local partitioning : 1451.25 1604.38 9.5
C: single remote table : 8818.92 9297.76 5.1
D: sharding (single con) : 5966.14 6646.73 10.2
E: sharding (multi con) : 1802.25 6515.49 72.3

fetch_size = 100
patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3033.48 2997.44 -1.2
B: local partitioning : 1405.52 1426.66 1.5
C: single remote table : 8335.50 8463.22 1.5
D: sharding (single con) : 6862.92 6820.97 -0.6
E: sharding (multi con) : 2185.84 6733.63 67.5
F: partition (single con): 6818.13 6741.01 -1.1
G: partition (multi con) : 2150.58 6407.46 66.4

fetch_size = 200
patched(ms) unpatched(ms) gain(%)
A: simple table scan :
B: local partitioning :
C: single remote table :
D: sharding (single con) :
E: sharding (multi con) :
F: partition (single con):
G: partition (multi con) :

fetch_size = 1000
patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3050.31 2980.29 -2.3
B: local partitioning : 1401.34 1419.54 1.3
C: single remote table : 8375.4 8445.27 0.8
D: sharding (single con) : 3935.97 4737.84 16.9
E: sharding (multi con) : 1330.44 4752.87 72.0
F: partition (single con): 3997.63 4747.44 15.8
G: partition (multi con) : 1323.02 4807.72 72.5

Async append doesn't affect non-async path at all so B is
expected to get no degradation. It seems within error.

C and F are the gain when all foreign tables share one connection
and D and G are the gain when every foreign tables has dedicate
connection.

I will repost after filling the blank portion of the tables and
complete regression of the patch next week. Sorry for the
incomplete post.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 7ad4210dd20b6672367255492e2b1d95cd90b122 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4deeb..890972b9b8 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7fb8..5457899f2d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5baf01..30edc8e83a 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	{
 		WaitEventSet *new_event_set;
 
-		new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+		new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
 		AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 		AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index bce021e100..802b79a660 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -126,6 +126,7 @@ typedef struct ResourceOwnerData
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
 	ResourceArray jitarr;		/* JIT contexts */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -171,6 +172,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -440,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -549,6 +552,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			jit_release_context(context);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -697,6 +710,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
 	Assert(owner->jitarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -724,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
 	ResourceArrayFree(&(owner->jitarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1301,3 +1316,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
 		elog(ERROR, "JIT context %p is not owned by resource owner %s",
 			 DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48874..838845af01 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a6e8eb71ab..3c06e4c3f8 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
 					   Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.16.3

0002-core-side-modification.patchtext/x-patch; charset=us-asciiDownload

From 0b3b692e677f7fd19f618582412acf9d12231bb2 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:23:51 +0900
Subject: [PATCH 2/3] core side modification

---
 contrib/postgres_fdw/expected/postgres_fdw.out | 100 +++++-----
 src/backend/commands/explain.c                 |  17 ++
 src/backend/executor/Makefile                  |   2 +-
 src/backend/executor/execAsync.c               | 145 ++++++++++++++
 src/backend/executor/nodeAppend.c              | 262 +++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c         |  22 ++-
 src/backend/nodes/copyfuncs.c                  |   2 +
 src/backend/nodes/outfuncs.c                   |   2 +
 src/backend/nodes/readfuncs.c                  |   2 +
 src/backend/optimizer/plan/createplan.c        |  68 ++++++-
 src/backend/postmaster/pgstat.c                |   3 +
 src/backend/utils/adt/ruleutils.c              |   8 +-
 src/include/executor/execAsync.h               |  23 +++
 src/include/executor/executor.h                |   1 +
 src/include/executor/nodeForeignscan.h         |   3 +
 src/include/foreign/fdwapi.h                   |  11 ++
 src/include/nodes/execnodes.h                  |  18 +-
 src/include/nodes/plannodes.h                  |   7 +
 src/include/pgstat.h                           |   3 +-
 19 files changed, 603 insertions(+), 96 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index bb6b1a8fdf..248aa73c0b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6968,12 +6968,13 @@ select * from bar where f1 in (select f1 from foo) for update;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
@@ -7006,12 +7007,13 @@ select * from bar where f1 in (select f1 from foo) for share;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
@@ -7043,9 +7045,8 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
    ->  Hash Join
@@ -7061,12 +7062,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7096,14 +7098,11 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo2_1
+               ->  Async Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
@@ -7123,17 +7122,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo2_1
+                     ->  Async Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -8155,11 +8155,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
   a  |  b  |  c   
@@ -8178,9 +8179,10 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  Sort
    Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 1 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
-(5 rows)
+(6 rows)
 
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a < 10 ORDER BY 1,2,3;
  a | b |  c   
@@ -8200,11 +8202,12 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
        t1       |       t2       
@@ -8223,11 +8226,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
   a  |  b  
@@ -8309,10 +8313,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
          Group Key: fpagg_tab_p1.a
          Filter: (avg(fpagg_tab_p1.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1
-               ->  Foreign Scan on fpagg_tab_p2
-               ->  Foreign Scan on fpagg_tab_p3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1
+               ->  Async Foreign Scan on fpagg_tab_p2
+               ->  Async Foreign Scan on fpagg_tab_p3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8323,13 +8328,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: fpagg_tab_p1.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73d94b7235..09c5327cb4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -83,6 +83,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
 			  ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1168,6 +1169,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 		if (plan->parallel_aware)
 			appendStringInfoString(es->str, "Parallel ");
+		if (plan->async_capable)
+			appendStringInfoString(es->str, "Async ");
 		appendStringInfoString(es->str, pname);
 		es->indent++;
 	}
@@ -1690,6 +1693,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Hash:
 			show_hash_info(castNode(HashState, planstate), es);
 			break;
+
+		case T_Append:
+			show_append_info(castNode(AppendState, planstate), es);
+			break;
+
 		default:
 			break;
 	}
@@ -2027,6 +2035,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 						 ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+	Append *plan = (Append *) astate->ps.plan;
+
+	if (plan->nasyncplans > 0)
+		ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..8ad2adfe1c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..db477e2cf6
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,145 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+	pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+					   void *data, bool reinit)
+{
+	switch (nodeTag(node))
+	{
+	case T_ForeignScanState:
+		return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+											 wes, data, reinit);
+		break;
+	default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(node));
+	}
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+	int **p_refind;
+	int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+	/* arg is the address of the variable refind in ExecAsyncEventWait */
+	ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+	*mcbarg->p_refind = NULL;
+	*mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+	static int *refind = NULL;
+	static int refindsize = 0;
+	WaitEventSet *wes;
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int noccurred = 0;
+	Bitmapset *fired_events = NULL;
+	int i;
+	int n;
+
+	n = bms_num_members(waitnodes);
+	wes = CreateWaitEventSet(TopTransactionContext,
+							 TopTransactionResourceOwner, n);
+	if (refindsize < n)
+	{
+		if (refindsize == 0)
+			refindsize = EVENT_BUFFER_SIZE; /* XXX */
+		while (refindsize < n)
+			refindsize *= 2;
+		if (refind)
+			refind = (int *) repalloc(refind, refindsize * sizeof(int));
+		else
+		{
+			static ExecAsync_mcbarg mcb_arg =
+				{ &refind, &refindsize };
+			static MemoryContextCallback mcb =
+				{ ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+			MemoryContext oldctxt =
+				MemoryContextSwitchTo(TopTransactionContext);
+
+			/*
+			 * refind points to a memory block in
+			 * TopTransactionContext. Register a callback to reset it.
+			 */
+			MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+			refind = (int *) palloc(refindsize * sizeof(int));
+			MemoryContextSwitchTo(oldctxt);
+		}
+	}
+
+	n = 0;
+	for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+		 i = bms_next_member(waitnodes, i))
+	{
+		refind[i] = i;
+		if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+			n++;
+	}
+
+	if (n == 0)
+	{
+		FreeWaitEventSet(wes);
+		return NULL;
+	}
+
+	noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+								 EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+	FreeWaitEventSet(wes);
+	if (noccurred == 0)
+		return NULL;
+
+	for (i = 0 ; i < noccurred ; i++)
+	{
+		WaitEvent *w = &occurred_event[i];
+
+		if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		{
+			int n = *(int*)w->user_data;
+
+			fired_events = bms_add_member(fired_events, n);
+		}
+	}
+
+	return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6bc3e470bf..ed8612dd37 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -81,6 +82,7 @@ struct ParallelAppendState
 #define NO_MATCHING_SUBPLANS		-2
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,13 +106,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	PlanState **appendplanstates;
 	Bitmapset  *validsubplans;
 	int			nplans;
+	int			nasyncplans;
 	int			firstvalid;
 	int			i,
 				j;
 	ListCell   *lc;
 
 	/* check for unsupported flags */
-	Assert(!(eflags & EXEC_FLAG_MARK));
+	Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
 	/*
 	 * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -123,10 +126,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
-	appendstate->ps.ExecProcNode = ExecAppend;
+
+	/* choose appropriate version of Exec function */
+	if (node->nasyncplans == 0)
+		appendstate->ps.ExecProcNode = ExecAppend;
+	else
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
 
 	/* Let choose_next_subplan_* function handle setting the first subplan */
-	appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+	appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
 	/* If run-time partition pruning is enabled, then set that up now */
 	if (node->part_prune_infos != NIL)
@@ -159,7 +167,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 			 */
 			if (bms_is_empty(validsubplans))
 			{
-				appendstate->as_whichplan = NO_MATCHING_SUBPLANS;
+				appendstate->as_whichsyncplan = NO_MATCHING_SUBPLANS;
 
 				/* Mark the first as valid so that it's initialized below */
 				validsubplans = bms_make_singleton(0);
@@ -213,11 +221,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 */
 	j = i = 0;
 	firstvalid = nplans;
+	nasyncplans = 0;
 	foreach(lc, node->appendplans)
 	{
 		if (bms_is_member(i, validsubplans))
 		{
 			Plan	   *initNode = (Plan *) lfirst(lc);
+			int			sub_eflags = eflags;
+
+			/* Let async-capable subplans run asynchronously */
+			if (i < node->nasyncplans)
+			{
+				sub_eflags |= EXEC_FLAG_ASYNC;
+				nasyncplans++;
+			}
 
 			/*
 			 * Record the lowest appendplans index which is a valid partial
@@ -226,7 +243,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 			if (i >= node->first_partial_plan && j < firstvalid)
 				firstvalid = j;
 
-			appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+			appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
 		}
 		i++;
 	}
@@ -235,6 +252,21 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
 
+	/* fill in async stuff */
+	appendstate->as_nasyncplans = nasyncplans;
+	appendstate->as_syncdone = (nasyncplans == nplans);
+
+	if (appendstate->as_nasyncplans)
+	{
+		appendstate->as_asyncresult = (TupleTableSlot **)
+			palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+		/* initially, all async requests need a request */
+		for (i = 0; i < appendstate->as_nasyncplans; ++i)
+			appendstate->as_needrequest =
+				bms_add_member(appendstate->as_needrequest, i);
+	}
+
 	/*
 	 * Miscellaneous initialization
 	 */
@@ -258,21 +290,23 @@ ExecAppend(PlanState *pstate)
 {
 	AppendState *node = castNode(AppendState, pstate);
 
-	if (node->as_whichplan < 0)
+	if (node->as_whichsyncplan < 0)
 	{
 		/*
 		 * If no subplan has been chosen, we must choose one before
 		 * proceeding.
 		 */
-		if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+		if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
 			!node->choose_next_subplan(node))
 			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
 		/* Nothing to do if there are no matching subplans */
-		else if (node->as_whichplan == NO_MATCHING_SUBPLANS)
+		else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
 			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
+	Assert(node->as_nasyncplans == 0);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -283,8 +317,9 @@ ExecAppend(PlanState *pstate)
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(node->as_whichsyncplan >= 0 &&
+			   node->as_whichsyncplan < node->as_nplans);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -307,6 +342,156 @@ ExecAppend(PlanState *pstate)
 	}
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+	AppendState *node = castNode(AppendState, pstate);
+	Bitmapset *needrequest;
+	int	i;
+
+	Assert(node->as_nasyncplans > 0);
+
+	if (node->as_nasyncresult > 0)
+	{
+		--node->as_nasyncresult;
+		return node->as_asyncresult[node->as_nasyncresult];
+	}
+
+	needrequest = node->as_needrequest;
+	node->as_needrequest = NULL;
+	while ((i = bms_first_member(needrequest)) >= 0)
+	{
+		TupleTableSlot *slot;
+		PlanState *subnode = node->appendplans[i];
+
+		slot = ExecProcNode(subnode);
+		if (subnode->asyncstate == AS_AVAILABLE)
+		{
+			if (!TupIsNull(slot))
+			{
+				node->as_asyncresult[node->as_nasyncresult++] = slot;
+				node->as_needrequest = bms_add_member(node->as_needrequest, i);
+			}
+		}
+		else
+			node->as_pending_async = bms_add_member(node->as_pending_async, i);
+	}
+	bms_free(needrequest);
+
+	for (;;)
+	{
+		TupleTableSlot *result;
+
+		/* return now if a result is available */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		while (!bms_is_empty(node->as_pending_async))
+		{
+			long timeout = node->as_syncdone ? -1 : 0;
+			Bitmapset *fired;
+			int i;
+
+			fired = ExecAsyncEventWait(node->appendplans,
+									   node->as_pending_async,
+									   timeout);
+			Assert(!node->as_syncdone || !bms_is_empty(fired));
+
+			while ((i = bms_first_member(fired)) >= 0)
+			{
+				TupleTableSlot *slot;
+				PlanState *subnode = node->appendplans[i];
+				slot = ExecProcNode(subnode);
+				if (subnode->asyncstate == AS_AVAILABLE)
+				{
+					if (!TupIsNull(slot))
+					{
+						node->as_asyncresult[node->as_nasyncresult++] = slot;
+						node->as_needrequest =
+							bms_add_member(node->as_needrequest, i);
+					}
+					node->as_pending_async =
+						bms_del_member(node->as_pending_async, i);
+				}
+			}
+			bms_free(fired);
+
+			/* return now if a result is available */
+			if (node->as_nasyncresult > 0)
+			{
+				--node->as_nasyncresult;
+				return node->as_asyncresult[node->as_nasyncresult];
+			}
+
+			if (!node->as_syncdone)
+				break;
+		}
+
+		/*
+		 * If there is no asynchronous activity still pending and the
+		 * synchronous activity is also complete, we're totally done scanning
+		 * this node.  Otherwise, we're done with the asynchronous stuff but
+		 * must continue scanning the synchronous children.
+		 */
+		if (node->as_syncdone)
+		{
+			Assert(bms_is_empty(node->as_pending_async));
+			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		/*
+		 * get a tuple from the subplan
+		 */
+		
+		if (node->as_whichsyncplan < 0)
+		{
+			/*
+			 * If no subplan has been chosen, we must choose one before
+			 * proceeding.
+			 */
+			if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
+				!node->choose_next_subplan(node))
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+
+			/* Nothing to do if there are no matching subplans */
+			else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+		if (!TupIsNull(result))
+		{
+			/*
+			 * If the subplan gave us something then return it as-is. We do
+			 * NOT make use of the result slot that was set up in
+			 * ExecInitAppend; there's no need for it.
+			 */
+			return result;
+		}
+
+		/*
+		 * Go on to the "next" subplan in the appropriate direction. If no
+		 * more subplans, return the empty slot set up for us by
+		 * ExecInitAppend, unless there are async plans we have yet to finish.
+		 */
+		if (!node->choose_next_subplan(node))
+		{
+			node->as_syncdone = true;
+			if (bms_is_empty(node->as_pending_async))
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/* Else loop back and try to get a tuple from the new subplan */
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecEndAppend
  *
@@ -353,6 +538,15 @@ ExecReScanAppend(AppendState *node)
 		node->as_valid_subplans = NULL;
 	}
 
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+	{
+		ExecShutdownNode(node->appendplans[i]);
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	}
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -373,7 +567,7 @@ ExecReScanAppend(AppendState *node)
 	}
 
 	/* Let choose_next_subplan_* function handle setting the first subplan */
-	node->as_whichplan = INVALID_SUBPLAN_INDEX;
+	node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -461,7 +655,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-	int			whichplan = node->as_whichplan;
+	int			whichplan = node->as_whichsyncplan;
 	int			nextplan;
 
 	/* We should never be called when there are no subplans */
@@ -494,7 +688,7 @@ choose_next_subplan_locally(AppendState *node)
 	if (nextplan < 0)
 		return false;
 
-	node->as_whichplan = nextplan;
+	node->as_whichsyncplan = nextplan;
 
 	return true;
 }
@@ -516,19 +710,19 @@ choose_next_subplan_for_leader(AppendState *node)
 	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
 	/* We should never be called when there are no subplans */
-	Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+	Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
 	{
 		/* Mark just-completed subplan as finished. */
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 	}
 	else
 	{
 		/* Start with last subplan. */
-		node->as_whichplan = node->as_nplans - 1;
+		node->as_whichsyncplan = node->as_nplans - 1;
 
 		/*
 		 * If we've yet to determine the valid subplans for these parameters
@@ -549,12 +743,12 @@ choose_next_subplan_for_leader(AppendState *node)
 	}
 
 	/* Loop until we find a subplan to execute. */
-	while (pstate->pa_finished[node->as_whichplan])
+	while (pstate->pa_finished[node->as_whichsyncplan])
 	{
-		if (node->as_whichplan == 0)
+		if (node->as_whichsyncplan == 0)
 		{
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-			node->as_whichplan = INVALID_SUBPLAN_INDEX;
+			node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 			LWLockRelease(&pstate->pa_lock);
 			return false;
 		}
@@ -563,12 +757,12 @@ choose_next_subplan_for_leader(AppendState *node)
 		 * We needn't pay attention to as_valid_subplans here as all invalid
 		 * plans have been marked as finished.
 		 */
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < node->as_first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < node->as_first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
@@ -597,13 +791,13 @@ choose_next_subplan_for_worker(AppendState *node)
 	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
 	/* We should never be called when there are no subplans */
-	Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+	Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
 	/* Mark just-completed subplan as finished. */
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	/*
 	 * If we've yet to determine the valid subplans for these parameters then
@@ -625,7 +819,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Save the plan from which we are starting the search. */
-	node->as_whichplan = pstate->pa_next_plan;
+	node->as_whichsyncplan = pstate->pa_next_plan;
 
 	/* Loop until we find a valid subplan to execute. */
 	while (pstate->pa_finished[pstate->pa_next_plan])
@@ -639,7 +833,7 @@ choose_next_subplan_for_worker(AppendState *node)
 			/* Advance to the next valid plan. */
 			pstate->pa_next_plan = nextplan;
 		}
-		else if (node->as_whichplan > node->as_first_partial_plan)
+		else if (node->as_whichsyncplan > node->as_first_partial_plan)
 		{
 			/*
 			 * Try looping back to the first valid partial plan, if there is
@@ -648,7 +842,7 @@ choose_next_subplan_for_worker(AppendState *node)
 			nextplan = bms_next_member(node->as_valid_subplans,
 									   node->as_first_partial_plan - 1);
 			pstate->pa_next_plan =
-				nextplan < 0 ? node->as_whichplan : nextplan;
+				nextplan < 0 ? node->as_whichsyncplan : nextplan;
 		}
 		else
 		{
@@ -656,10 +850,10 @@ choose_next_subplan_for_worker(AppendState *node)
 			 * At last plan, and either there are no partial plans or we've
 			 * tried them all.  Arrange to bail out.
 			 */
-			pstate->pa_next_plan = node->as_whichplan;
+			pstate->pa_next_plan = node->as_whichsyncplan;
 		}
 
-		if (pstate->pa_next_plan == node->as_whichplan)
+		if (pstate->pa_next_plan == node->as_whichsyncplan)
 		{
 			/* We've tried everything! */
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -669,7 +863,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Pick the plan we found, and advance pa_next_plan one more time. */
-	node->as_whichplan = pstate->pa_next_plan;
+	node->as_whichsyncplan = pstate->pa_next_plan;
 	pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
 										   pstate->pa_next_plan);
 
@@ -696,8 +890,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < node->as_first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < node->as_first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index a2a28b7ec2..915deb7080 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
 					(ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
 	scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+	scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+	if ((eflags & EXEC_FLAG_ASYNC) != 0)
+		scanstate->fs_async = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -387,3 +390,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+							  void *caller_data, bool reinit)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+												 caller_data, reinit);
+}
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c045a7afe..8304dd5b17 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -246,6 +246,8 @@ _copyAppend(const Append *from)
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(first_partial_plan);
 	COPY_NODE_FIELD(part_prune_infos);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1da9d7ed15..ed655f4ccb 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -403,6 +403,8 @@ _outAppend(StringInfo str, const Append *node)
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(first_partial_plan);
 	WRITE_NODE_FIELD(part_prune_infos);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2826cec2f8..fb4ae251de 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1652,6 +1652,8 @@ _readAppend(void)
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(first_partial_plan);
 	READ_NODE_FIELD(part_prune_infos);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0317763f43..eda3420d02 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -211,7 +211,9 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels, List *partpruneinfos);
+						   int nasyncplans,	int referent,
+						   List *tlist,
+						   List *partitioned_rels, List *partpruneinfos);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -294,6 +296,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1036,10 +1039,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	List	   *partpruneinfos = NIL;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1074,7 +1081,22 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/*
+		 * Classify as async-capable or not. If we have decided to run the
+		 * chidlren in parallel, we cannot any one of them run asynchronously.
+		 */
+		if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+		{
+			subplan->async_capable = true;
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	if (enable_partition_pruning &&
@@ -1117,9 +1139,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, best_path->partitioned_rels,
-					   partpruneinfos);
+	plan = make_append(list_concat(asyncplans, syncplans),
+					   best_path->first_partial_path, nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels, partpruneinfos);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5414,9 +5437,9 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels,
-			List *partpruneinfos)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+			int referent, List *tlist,
+			List *partitioned_rels, List *partpruneinfos)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5429,6 +5452,9 @@ make_append(List *appendplans, int first_partial_plan,
 	node->appendplans = appendplans;
 	node->first_partial_plan = first_partial_plan;
 	node->part_prune_infos = partpruneinfos;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
+
 	return node;
 }
 
@@ -6773,3 +6799,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c..7aef97ca97 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 065238b0fe..fe202cbfea 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4513,7 +4513,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	dpns->planstate = ps;
 
 	/*
-	 * We special-case Append and MergeAppend to pretend that the first child
+	 * We special-case Append and MergeAppend to pretend that a specific child
 	 * plan is the OUTER referent; we have to interpret OUTER Vars in their
 	 * tlists according to one of the children, and the first one is the most
 	 * natural choice.  Likewise special-case ModifyTable to pretend that the
@@ -4521,7 +4521,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		AppendState *aps = (AppendState *) ps;
+		Append *app = (Append *) ps->plan;
+		dpns->outer_planstate = aps->appendplans[app->referent];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..5fd67d9004
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+								   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+									 long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a7ea3c7d10..8e9d87669f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be733..67abf8e52e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+										  WaitEventSet *wes,
+										  void *caller_data, bool reinit);
 
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index c14eb546c6..c00e9621fb 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -168,6 +168,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+													WaitEventSet *wes,
+													void *caller_data,
+													bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -189,6 +194,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	IterateForeignScan_function IterateForeignScanAsync;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
@@ -241,6 +247,11 @@ typedef struct FdwRoutine
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 
 	/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7f52cab0..56bfe3f442 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -905,6 +905,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+	AS_AVAILABLE,
+	AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
 	NodeTag		type;
@@ -953,6 +959,9 @@ typedef struct PlanState
 	 * descriptor, without encoding knowledge about all executor nodes.
 	 */
 	TupleDesc	scandesc;
+
+	AsyncState	asyncstate;
+	int32		padding;			/* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1087,14 +1096,20 @@ struct AppendState
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
-	int			as_whichplan;
+	int			as_whichsyncplan; /* which sync plan is being executed  */
 	int			as_first_partial_plan;	/* Index of 'appendplans' containing
 										 * the first partial plan */
+	int			as_nasyncplans;	/* # of async-capable children */
 	ParallelAppendState *as_pstate; /* parallel coordination info */
 	Size		pstate_len;		/* size of parallel coordination info */
 	struct PartitionPruneState *as_prune_state;
 	Bitmapset  *as_valid_subplans;
 	bool		(*choose_next_subplan) (AppendState *);
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	Bitmapset  *as_pending_async;	/* pending async plans */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1643,6 +1658,7 @@ typedef struct ForeignScanState
 	Size		pscan_len;		/* size of parallel coordination information */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	bool		fs_async;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2dda82e66..8a64c037c9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -139,6 +139,11 @@ typedef struct Plan
 	bool		parallel_aware; /* engage parallel-aware logic? */
 	bool		parallel_safe;	/* OK to use as part of parallel plan? */
 
+	/*
+	 * information needed for asynchronous execution
+	 */
+	bool		async_capable;  /* engage asyncronous execution logic? */
+
 	/*
 	 * Common structural data for all Plan types.
 	 */
@@ -262,6 +267,8 @@ typedef struct Append
 	 * Mapping details for run-time subplan pruning, one per partitioned_rels
 	 */
 	List	   *part_prune_infos;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..6f4583b46c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.16.3

0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

From 072f6af8a2b394402e753a65569d64668e2cfe86 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 ++
 contrib/postgres_fdw/expected/postgres_fdw.out | 100 ++--
 contrib/postgres_fdw/postgres_fdw.c            | 619 ++++++++++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 628 insertions(+), 139 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..da7c826e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
+		entry->storage = NULL;
 	}
 
 	/*
@@ -215,6 +217,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	return entry->conn;
 }
 
+/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	bool		found;
+	ConnCacheEntry *entry;
+	ConnCacheKey key;
+
+	key = user->umid;
+	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+	Assert(found);
+
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 248aa73c0b..bb6b1a8fdf 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6968,13 +6968,12 @@ select * from bar where f1 in (select f1 from foo) for update;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
-                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
-                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-(29 rows)
+                           ->  Foreign Scan on public.foo2
+                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+(23 rows)
 
 select * from bar where f1 in (select f1 from foo) for update;
  f1 | f2 
@@ -7007,13 +7006,12 @@ select * from bar where f1 in (select f1 from foo) for share;
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
-                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
-                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-(29 rows)
+                           ->  Foreign Scan on public.foo2
+                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+(23 rows)
 
 select * from bar where f1 in (select f1 from foo) for share;
  f1 | f2 
@@ -7045,8 +7043,9 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+                           ->  Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
    ->  Hash Join
@@ -7062,13 +7061,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           Async subplans: 1 
-                           ->  Async Foreign Scan on public.foo2
-                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
-                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
                            ->  Seq Scan on public.foo
                                  Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-(41 rows)
+                           ->  Foreign Scan on public.foo2
+                                 Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
+                                 Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+(39 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7098,11 +7096,14 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               Async subplans: 2 
-               ->  Async Foreign Scan on public.foo2
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Async Foreign Scan on public.foo2 foo2_1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
+               ->  Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
          ->  Hash
@@ -7122,18 +7123,17 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     Async subplans: 2 
-                     ->  Async Foreign Scan on public.foo2
-                           Output: ROW(foo2.f1), foo2.f1
-                           Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Async Foreign Scan on public.foo2 foo2_1
-                           Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
-                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo
                            Output: ROW(foo.f1), foo.f1
+                     ->  Foreign Scan on public.foo2
+                           Output: ROW(foo2.f1), foo2.f1
+                           Remote SQL: SELECT f1 FROM public.loct1
                      ->  Seq Scan on public.foo foo_1
                            Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-(47 rows)
+                     ->  Foreign Scan on public.foo2 foo2_1
+                           Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
+                           Remote SQL: SELECT f1 FROM public.loct1
+(45 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -8155,12 +8155,11 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         Async subplans: 2 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3)
-(8 rows)
+(7 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
   a  |  b  |  c   
@@ -8179,10 +8178,9 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  Sort
    Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c
    ->  Append
-         Async subplans: 1 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
-(6 rows)
+(5 rows)
 
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a < 10 ORDER BY 1,2,3;
  a | b |  c   
@@ -8202,12 +8200,11 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Append
-         Async subplans: 2 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(8 rows)
+(7 rows)
 
 SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
        t1       |       t2       
@@ -8226,12 +8223,11 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         Async subplans: 2 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(8 rows)
+(7 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
   a  |  b  
@@ -8313,11 +8309,10 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
          Group Key: fpagg_tab_p1.a
          Filter: (avg(fpagg_tab_p1.b) < '22'::numeric)
          ->  Append
-               Async subplans: 3 
-               ->  Async Foreign Scan on fpagg_tab_p1
-               ->  Async Foreign Scan on fpagg_tab_p2
-               ->  Async Foreign Scan on fpagg_tab_p3
-(10 rows)
+               ->  Foreign Scan on fpagg_tab_p1
+               ->  Foreign Scan on fpagg_tab_p2
+               ->  Foreign Scan on fpagg_tab_p3
+(9 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8328,14 +8323,13 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: fpagg_tab_p1.a
    ->  Append
-         Async subplans: 3 
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
-         ->  Async Foreign Scan
+         ->  Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
-(10 rows)
+(9 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 78b0f43ca8..8efbbf95a8 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -119,11 +125,28 @@ enum FdwDirectModifyPrivateIndex
 	FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState   *leader;		/* leader node of this connection */
+	bool				busy;		/* true if this connection is busy */
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,12 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		inqueue;		/* true if this node is in waiter queue */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* last waiting node in waiting queue.
+									 * valid only on the leader node */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +192,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +219,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -293,6 +323,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -358,6 +389,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel,
 							 void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+											  WaitEventSet *wes,
+											  void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -378,7 +413,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
 					  RangeTblEntry *rte,
@@ -469,6 +506,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -505,6 +543,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1355,12 +1397,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->leader = NULL;
+	fsstate->s.connpriv->busy = false;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->inqueue = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1408,40 +1460,250 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 							 &fsstate->param_values);
 }
 
+/*
+ * Async queue manipuration functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * adds the node to the end of waiter queue
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+	PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *leader = fsstate->s.connpriv->leader;
+	PgFdwScanState   *leader_state;
+	PgFdwScanState   *last_waiter_state;
+
+	Assert(leader && leader != node);
+
+	/* do nothing if the node is already in the queue */
+	if (fsstate->inqueue)
+		return;
+
+	leader_state = GetPgFdwScanState(leader);
+	last_waiter_state = GetPgFdwScanState(leader_state->last_waiter);
+	last_waiter_state->waiter = node;
+	leader_state->last_waiter = node;
+	fsstate->inqueue = true;
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Makes the first waiter be next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *ret = fsstate->waiter;
+
+	Assert(fsstate->s.connpriv->leader = node);
+	
+	if (ret)
+	{
+		PgFdwScanState *retstate = GetPgFdwScanState(ret);
+		fsstate->waiter = NULL;
+		retstate->last_waiter = fsstate->last_waiter;
+		retstate->inqueue = false;
+	}
+
+	fsstate->s.connpriv->leader = ret;
+
+	return ret;
+}
+
+/*
+ * remove the node from waiter queue
+ *
+ * This is a bit different from the two above in the sense that this can
+ * operate on connection leader. The result is absorbed when this is called on
+ * active leader.
+ *
+ * Returns true if the node was found.
+ */
+static inline bool
+remove_async_node(ForeignScanState *node)
+{
+	PgFdwScanState		*fsstate = GetPgFdwScanState(node);
+	ForeignScanState	*leader = fsstate->s.connpriv->leader;
+	PgFdwScanState		*leader_state;
+	ForeignScanState	*prev;
+	PgFdwScanState		*prev_state;
+	ForeignScanState	*cur;
+
+	/* no need to remove me */
+	if (!leader || !fsstate->inqueue)
+		return false;
+
+	leader_state = GetPgFdwScanState(leader);
+
+	/* Remove the leader node */
+	if (leader == node)
+	{
+		ForeignScanState	*next_leader;
+
+		if (leader_state->s.connpriv->busy)
+		{
+			/*
+			 * this node is waiting for result, absorb the result first so
+			 * that the following commands can be sent on the connection.
+			 */
+			PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+			PGconn *conn = leader_state->s.conn;
+
+			while(PQisBusy(conn))
+				PQclear(PQgetResult(conn));
+			
+			leader_state->s.connpriv->busy = false;
+		}
+
+		/* Make the first waiter the leader */
+		if (leader_state->waiter)
+		{
+			PgFdwScanState *next_leader_state;
+
+			next_leader = leader_state->waiter;
+			next_leader_state = GetPgFdwScanState(next_leader);
+
+			leader_state->s.connpriv->leader = next_leader;
+			next_leader_state->last_waiter = leader_state->last_waiter;
+		}
+		leader_state->waiter = NULL;
+
+		return true;
+	}
+
+	/*
+	 * Just remove the node in queue
+	 *
+	 * This function is called on the shutdown path. We don't bother
+	 * considering faster way to do this.
+	 */
+	prev = leader;
+	prev_state = leader_state;
+	cur =  GetPgFdwScanState(prev)->waiter;
+	while (cur)
+	{
+		PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+		if (cur == node)
+		{
+			prev_state->waiter = curstate->waiter;
+			if (leader_state->last_waiter == cur)
+				leader_state->last_waiter = prev;
+			else
+				leader_state->last_waiter = cur;
+
+			fsstate->inqueue = false;
+
+			return true;
+		}
+		prev = cur;
+		prev_state = curstate;
+		cur = curstate->waiter;
+	}
+
+	return false;
+}
+
 /*
  * postgresIterateForeignScan
- *		Retrieve next row from the result set, or clear tuple slot to indicate
- *		EOF.
+ *		Retrieve next row from the result set.
+ *
+ *		For synchronous nodes, returns clear tuples slot to indicte EOF.
+ *
+ *		If the node is asynchronous one, clear tuple slot has two meanings.
+ *		If the caller receives clear tuple slot, asyncstate indicates wheter
+ *		the node is EOF (AS_AVAILABLE) or waiting for data to
+ *		come(AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
+	if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+	{
+		/* we've run out, get some more tuples */
+		if (!node->fs_async)
+		{
+			/* finish running query to send my command */
+			if (!fsstate->s.connpriv->busy)
+				vacate_connection((PgFdwState *)fsstate, false);
+				
+			request_more_data(node);
+
+			/*
+			 * Fetch the result immediately. This executes the next waiter if
+			 * any.
+			 */
+			fetch_received_data(node);
+		}
+        else if (!fsstate->s.connpriv->busy)
+		{
+			/* If the connection is not busy, just send the request. */
+			request_more_data(node);
+		}
+        else if (fsstate->s.connpriv->leader == node)
+		{
+			bool available = true;
+
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (!(rc & WL_SOCKET_READABLE))
+					available = false;
+			}
+
+			/* The next waiter is executed automatcically */
+			if (available)
+				fetch_received_data(node);
+		}
+		else if (fsstate->s.connpriv->leader)
+		{
+			/*
+			 * Anyone else is waiting on this connection then add this node to
+			 * waiting queue.
+			 */
+			add_async_waiter(node);
+		}
+	}
 
 	/*
-	 * Get some more tuples, if we've run out.
+	 * If we haven't received a result for the given node this time,
+	 * return with no tuple to give way to another node.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
-		if (fsstate->next_tuple >= fsstate->num_tuples)
-			return ExecClearTuple(slot);
+		if (fsstate->eof_reached)
+		{
+			fsstate->result_ready = true;
+			node->ss.ps.asyncstate = AS_AVAILABLE;
+		}
+		else
+		{
+			fsstate->result_ready = false;
+			node->ss.ps.asyncstate = AS_WAITING;
+		}
+			
+		return ExecClearTuple(slot);
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
+	node->ss.ps.asyncstate = AS_AVAILABLE;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1457,7 +1719,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1465,6 +1727,8 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	vacate_connection((PgFdwState *)fsstate, true);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1493,9 +1757,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1513,7 +1777,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1521,15 +1785,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* remove the node from waiting queue */
+	remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
@@ -1753,6 +2033,9 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	/* finish running query to send my command */
+	vacate_connection((PgFdwState *)fmstate, true);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1763,14 +2046,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1778,10 +2061,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1819,6 +2102,9 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	/* finish running query to send my command */
+	vacate_connection((PgFdwState *)fmstate, true);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1839,14 +2125,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1854,10 +2140,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1895,6 +2181,9 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	/* finish running query to send my command */
+	vacate_connection((PgFdwState *)fmstate, true);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1915,14 +2204,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1930,10 +2219,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2400,7 +2689,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Update the foreign-join-related fields. */
 	if (fsplan->scan.scanrelid == 0)
@@ -2485,7 +2776,11 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		/* finish running query to send my command */
+		vacate_connection((PgFdwState *)dmstate, true);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2532,8 +2827,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* close the target relation. */
 	if (dmstate->resultRel)
@@ -2656,6 +2951,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2698,6 +2994,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+
+			/* finish running query to send my command */
+			vacate_connection(&tmpstate, true);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3061,11 +3369,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -3128,50 +3436,121 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *leader = fsstate->s.connpriv->leader;
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* must be non-busy */
+	Assert(!fsstate->s.connpriv->busy);
+	/* must be not-eof */
+	Assert(!fsstate->eof_reached);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->busy = true;
+
+	/* Let the node be the leader if it is different from current one */
+	if (leader != node)
+	{
+		/*
+		 * If the connection leader exists, insert the node as the connection
+		 * leader making the current leader be the first waiter.
+		 */
+		if (leader != NULL)
+		{
+			remove_async_node(node);
+			fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+			fsstate->waiter = leader;
+		}
+		fsstate->s.connpriv->leader = node;
+	}
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
+	ForeignScanState *waiter;
+
+	/* I should be the current connection leader */
+	Assert(fsstate->s.connpriv->leader == node);
 
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3181,26 +3560,76 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->busy = false;
+
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->busy = false;
+
+	/* let the first waiter be the next leader of this connection */
+	waiter = move_to_next_waiter(node);
+
+	/* send the next request if any */
+	if (waiter)
+		request_more_data(waiter);
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *leader;
+
+	/* the connection is alrady available */
+	if (connpriv == NULL || connpriv->leader == NULL || !connpriv->busy)
+		return;
+
+	/*
+	 * let the current connection leader read the result for the running query
+	 */
+	leader = connpriv->leader;
+	fetch_received_data(leader);
+
+	/* let the first waiter be the next leader of this connection */
+	move_to_next_waiter(leader);
+
+	if (!clear_queue)
+		return;
+
+	/* Clear the waiting list */
+	while (leader)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+		fsstate->last_waiter = NULL;
+		leader = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3314,7 +3743,9 @@ create_foreign_modify(EState *estate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Set up remote query information. */
@@ -3387,7 +3818,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3397,12 +3828,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3410,9 +3841,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3537,16 +3968,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -3706,9 +4137,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3716,10 +4147,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -5203,6 +5634,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection leader. Elsewise
+ * another node on this connection is the leader.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+								  void *caller_data, bool reinit)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->leader == node)
+	{
+		AddWaitEventToSet(wes,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, caller_data);
+		return true;
+	}
+
+	return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index a5d4011e8d..f344fb7f66 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 231b1e01a5..8ecc903c20 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1617,25 +1617,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1677,12 +1677,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1741,8 +1741,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.16.3

#65

Kyotaro HORIGUCHI

horiguchi.kyotaro@lab.ntt.co.jp

over 7 years ago

In reply to: Kyotaro HORIGUCHI (#64)

3 attachment(s)

Re: [HACKERS] asynchronous execution

This gets further refactoring.

At Fri, 11 May 2018 17:45:20 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20180511.174520.188681124.horiguchi.kyotaro@lab.ntt.co.jp>

But, this is not just a rebased version. On the way fixing
serious conflicts, I refactored patch and I believe this becomes
way readable than the previous shape.

- Waiting queue manipulation is moved into new functions. It had
a bug that the same node can be inserted in the queue more than
once and it is fixed.

- postgresIterateForeginScan had somewhat a tricky strcuture to
merge similar procedures thus it cannot be said easy-to-read at
all. Now it is far simpler and straight-forward looking.

- Still this works only on Append/ForeignScan.

I performed almost the same test (again) as before but with some
new things:

- partition tables (There should be no difference with
inheritance and it actually looks so.)

- added test for fetch_size of 200 and 1000 as long as 100.

Fetch size of 100 seems unreasonably magnifies the lag by
context switching on single poor box for the test D/F
below. They became faster by about twice by *adding* a small
delay (1000 times of clock_gettime()(*1)) just before
epoll_wait. Things would be different on separate machines but
I'm not sure it really is. I don't find the exact cause nor how
to avoid it.

*1: The reason for the function is that I found at first that the
queries get way faster by just prefixing by "explain
analyze"..

Async append (theoretically) no longer affects non-async path at
all so B is expected to get no degradation. It seems within
error.

C and F are the gain when all foreign tables share one connection
and D and G are the gain when every foreign tables has dedicate
connection.

(previous numbers)

patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3562.32 3444.81 -3.4
B: local partitioning : 1451.25 1604.38 9.5
C: single remote table : 8818.92 9297.76 5.1
D: sharding (single con) : 5966.14 6646.73 10.2
E: sharding (multi con) : 1802.25 6515.49 72.3

fetch_size = 100
patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3065.82 3046.82 -0.62
B: local partitioning : 1393.98 1378.00 -1.16
C: single remote table : 8499.73 8595.66 1.12
D: sharding (single con) : 9267.85 9251.59 -0.18
E: sharding (multi con) : 2567.02 9295.22 72.38
F: partition (single con): 9241.08 9060.19 -2.00
G: partition (multi con) : 2548.86 9419.18 72.94

fetch_size = 200
patched(ms) unpatched(ms) gain(%)
A: simple table scan : 3067.08 2999.23 -2.3
B: local partitioning : 1392.07 1384.49 -0.5
C: single remote table : 8521.72 8505.48 -0.2
D: sharding (single con) : 6752.81 7076.02 4.6
E: sharding (multi con) : 1958.2 7188.02 72.8
F: partition (single con): 6756.72 7000.72 3.5
G: partition (multi con) : 1969.8 7228.85 72.8

fetch_size = 1000
patched(ms) unpatched(ms) gain(%)
A: simple table scan : 4547.44 4519.34 -0.62
B: local partitioning : 2880.66 2739.43 -5.16
C: single remote table : 8448.04 8572.15 1.45
D: sharding (single con) : 2405.01 5919.31 59.37
E: sharding (multi con) : 1872.15 5963.04 68.60
F: partition (single con): 2369.08 5960.81 60.26
G: partition (multi con) : 1854.69 5893.65 68.53

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Allow-wait-event-set-to-be-registered-to-resource-ow.patchtext/x-patch; charset=us-asciiDownload

From 54f85c159f3feee5ee2dac6daacc7330ec101ed5 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 22 May 2017 12:42:58 +0900
Subject: [PATCH 1/3] Allow wait event set to be registered to resource owner

WaitEventSet needs to be released using resource owner for a certain
case. This change adds WaitEventSet reowner and allow the creator of a
WaitEventSet to specify a resource owner.
---
 src/backend/libpq/pqcomm.c                    |  2 +-
 src/backend/storage/ipc/latch.c               | 18 ++++++-
 src/backend/storage/lmgr/condition_variable.c |  2 +-
 src/backend/utils/resowner/resowner.c         | 67 +++++++++++++++++++++++++++
 src/include/storage/latch.h                   |  4 +-
 src/include/utils/resowner_private.h          |  8 ++++
 6 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c
index a4f6d4deeb..890972b9b8 100644
--- a/src/backend/libpq/pqcomm.c
+++ b/src/backend/libpq/pqcomm.c
@@ -220,7 +220,7 @@ pq_init(void)
 				(errmsg("could not set socket to nonblocking mode: %m")));
 #endif
 
-	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, 3);
+	FeBeWaitSet = CreateWaitEventSet(TopMemoryContext, NULL, 3);
 	AddWaitEventToSet(FeBeWaitSet, WL_SOCKET_WRITEABLE, MyProcPort->sock,
 					  NULL, NULL);
 	AddWaitEventToSet(FeBeWaitSet, WL_LATCH_SET, -1, MyLatch, NULL);
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index e6706f7fb8..5457899f2d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -51,6 +51,7 @@
 #include "storage/latch.h"
 #include "storage/pmsignal.h"
 #include "storage/shmem.h"
+#include "utils/resowner_private.h"
 
 /*
  * Select the fd readiness primitive to use. Normally the "most modern"
@@ -77,6 +78,8 @@ struct WaitEventSet
 	int			nevents;		/* number of registered events */
 	int			nevents_space;	/* maximum number of events in this set */
 
+	ResourceOwner	resowner;	/* Resource owner */
+
 	/*
 	 * Array, of nevents_space length, storing the definition of events this
 	 * set is waiting for.
@@ -359,7 +362,7 @@ WaitLatchOrSocket(volatile Latch *latch, int wakeEvents, pgsocket sock,
 	int			ret = 0;
 	int			rc;
 	WaitEvent	event;
-	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, 3);
+	WaitEventSet *set = CreateWaitEventSet(CurrentMemoryContext, NULL, 3);
 
 	if (wakeEvents & WL_TIMEOUT)
 		Assert(timeout >= 0);
@@ -517,12 +520,15 @@ ResetLatch(volatile Latch *latch)
  * WaitEventSetWait().
  */
 WaitEventSet *
-CreateWaitEventSet(MemoryContext context, int nevents)
+CreateWaitEventSet(MemoryContext context, ResourceOwner res, int nevents)
 {
 	WaitEventSet *set;
 	char	   *data;
 	Size		sz = 0;
 
+	if (res)
+		ResourceOwnerEnlargeWESs(res);
+
 	/*
 	 * Use MAXALIGN size/alignment to guarantee that later uses of memory are
 	 * aligned correctly. E.g. epoll_event might need 8 byte alignment on some
@@ -591,6 +597,11 @@ CreateWaitEventSet(MemoryContext context, int nevents)
 	StaticAssertStmt(WSA_INVALID_EVENT == NULL, "");
 #endif
 
+	/* Register this wait event set if requested */
+	set->resowner = res;
+	if (res)
+		ResourceOwnerRememberWES(set->resowner, set);
+
 	return set;
 }
 
@@ -632,6 +643,9 @@ FreeWaitEventSet(WaitEventSet *set)
 	}
 #endif
 
+	if (set->resowner != NULL)
+		ResourceOwnerForgetWES(set->resowner, set);
+
 	pfree(set);
 }
 
diff --git a/src/backend/storage/lmgr/condition_variable.c b/src/backend/storage/lmgr/condition_variable.c
index ef1d5baf01..30edc8e83a 100644
--- a/src/backend/storage/lmgr/condition_variable.c
+++ b/src/backend/storage/lmgr/condition_variable.c
@@ -69,7 +69,7 @@ ConditionVariablePrepareToSleep(ConditionVariable *cv)
 	{
 		WaitEventSet *new_event_set;
 
-		new_event_set = CreateWaitEventSet(TopMemoryContext, 2);
+		new_event_set = CreateWaitEventSet(TopMemoryContext, NULL, 2);
 		AddWaitEventToSet(new_event_set, WL_LATCH_SET, PGINVALID_SOCKET,
 						  MyLatch, NULL);
 		AddWaitEventToSet(new_event_set, WL_POSTMASTER_DEATH, PGINVALID_SOCKET,
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index bce021e100..802b79a660 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -126,6 +126,7 @@ typedef struct ResourceOwnerData
 	ResourceArray filearr;		/* open temporary files */
 	ResourceArray dsmarr;		/* dynamic shmem segments */
 	ResourceArray jitarr;		/* JIT contexts */
+	ResourceArray wesarr;		/* wait event sets */
 
 	/* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
 	int			nlocks;			/* number of owned locks */
@@ -171,6 +172,7 @@ static void PrintTupleDescLeakWarning(TupleDesc tupdesc);
 static void PrintSnapshotLeakWarning(Snapshot snapshot);
 static void PrintFileLeakWarning(File file);
 static void PrintDSMLeakWarning(dsm_segment *seg);
+static void PrintWESLeakWarning(WaitEventSet *events);
 
 
 /*****************************************************************************
@@ -440,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	ResourceArrayInit(&(owner->filearr), FileGetDatum(-1));
 	ResourceArrayInit(&(owner->dsmarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->jitarr), PointerGetDatum(NULL));
+	ResourceArrayInit(&(owner->wesarr), PointerGetDatum(NULL));
 
 	return owner;
 }
@@ -549,6 +552,16 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 			jit_release_context(context);
 		}
+
+		/* Ditto for wait event sets */
+		while (ResourceArrayGetAny(&(owner->wesarr), &foundres))
+		{
+			WaitEventSet *event = (WaitEventSet *) DatumGetPointer(foundres);
+
+			if (isCommit)
+				PrintWESLeakWarning(event);
+			FreeWaitEventSet(event);
+		}
 	}
 	else if (phase == RESOURCE_RELEASE_LOCKS)
 	{
@@ -697,6 +710,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	Assert(owner->filearr.nitems == 0);
 	Assert(owner->dsmarr.nitems == 0);
 	Assert(owner->jitarr.nitems == 0);
+	Assert(owner->wesarr.nitems == 0);
 	Assert(owner->nlocks == 0 || owner->nlocks == MAX_RESOWNER_LOCKS + 1);
 
 	/*
@@ -724,6 +738,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 	ResourceArrayFree(&(owner->filearr));
 	ResourceArrayFree(&(owner->dsmarr));
 	ResourceArrayFree(&(owner->jitarr));
+	ResourceArrayFree(&(owner->wesarr));
 
 	pfree(owner);
 }
@@ -1301,3 +1316,55 @@ ResourceOwnerForgetJIT(ResourceOwner owner, Datum handle)
 		elog(ERROR, "JIT context %p is not owned by resource owner %s",
 			 DatumGetPointer(handle), owner->name);
 }
+
+/*
+ * wait event set reference array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeWESs(ResourceOwner owner)
+{
+	ResourceArrayEnlarge(&(owner->wesarr));
+}
+
+/*
+ * Remember that a wait event set is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeWESs()
+ */
+void
+ResourceOwnerRememberWES(ResourceOwner owner, WaitEventSet *events)
+{
+	ResourceArrayAdd(&(owner->wesarr), PointerGetDatum(events));
+}
+
+/*
+ * Forget that a wait event set is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetWES(ResourceOwner owner, WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	if (!ResourceArrayRemove(&(owner->wesarr), PointerGetDatum(events)))
+		elog(ERROR, "wait event set %p is not owned by resource owner %s",
+			 events, owner->name);
+}
+
+/*
+ * Debugging subroutine
+ */
+static void
+PrintWESLeakWarning(WaitEventSet *events)
+{
+	/*
+	 * XXXX: There's no property to show as an identier of a wait event set,
+	 * use its pointer instead.
+	 */
+	elog(WARNING, "wait event set leak: %p still referenced",
+		 events);
+}
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index a4bcb48874..838845af01 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -101,6 +101,7 @@
 #define LATCH_H
 
 #include <signal.h>
+#include "utils/resowner.h"
 
 /*
  * Latch structure should be treated as opaque and only accessed through
@@ -162,7 +163,8 @@ extern void DisownLatch(volatile Latch *latch);
 extern void SetLatch(volatile Latch *latch);
 extern void ResetLatch(volatile Latch *latch);
 
-extern WaitEventSet *CreateWaitEventSet(MemoryContext context, int nevents);
+extern WaitEventSet *CreateWaitEventSet(MemoryContext context,
+										ResourceOwner res, int nevents);
 extern void FreeWaitEventSet(WaitEventSet *set);
 extern int AddWaitEventToSet(WaitEventSet *set, uint32 events, pgsocket fd,
 				  Latch *latch, void *user_data);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index a6e8eb71ab..3c06e4c3f8 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -18,6 +18,7 @@
 
 #include "storage/dsm.h"
 #include "storage/fd.h"
+#include "storage/latch.h"
 #include "storage/lock.h"
 #include "utils/catcache.h"
 #include "utils/plancache.h"
@@ -95,4 +96,11 @@ extern void ResourceOwnerRememberJIT(ResourceOwner owner,
 extern void ResourceOwnerForgetJIT(ResourceOwner owner,
 					   Datum handle);
 
+/* support for wait event set management */
+extern void ResourceOwnerEnlargeWESs(ResourceOwner owner);
+extern void ResourceOwnerRememberWES(ResourceOwner owner,
+						 WaitEventSet *);
+extern void ResourceOwnerForgetWES(ResourceOwner owner,
+					   WaitEventSet *);
+
 #endif							/* RESOWNER_PRIVATE_H */
-- 
2.16.3

0002-infrastructure-for-asynchronous-execution.patchtext/x-patch; charset=us-asciiDownload

From 19ff6af521070b8245f4bd04bd535a5286be1509 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 15 May 2018 20:21:32 +0900
Subject: [PATCH 2/3] infrastructure for asynchronous execution

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
---
 src/backend/commands/explain.c          |  17 ++
 src/backend/executor/Makefile           |   2 +-
 src/backend/executor/execAsync.c        | 145 ++++++++++++++++
 src/backend/executor/nodeAppend.c       | 285 ++++++++++++++++++++++++++++----
 src/backend/executor/nodeForeignscan.c  |  22 ++-
 src/backend/nodes/bitmapset.c           |  72 ++++++++
 src/backend/nodes/copyfuncs.c           |   2 +
 src/backend/nodes/outfuncs.c            |   2 +
 src/backend/nodes/readfuncs.c           |   2 +
 src/backend/optimizer/plan/createplan.c |  68 +++++++-
 src/backend/postmaster/pgstat.c         |   3 +
 src/backend/utils/adt/ruleutils.c       |   8 +-
 src/include/executor/execAsync.h        |  23 +++
 src/include/executor/executor.h         |   1 +
 src/include/executor/nodeForeignscan.h  |   3 +
 src/include/foreign/fdwapi.h            |  11 ++
 src/include/nodes/bitmapset.h           |   1 +
 src/include/nodes/execnodes.h           |  18 +-
 src/include/nodes/plannodes.h           |   7 +
 src/include/pgstat.h                    |   3 +-
 20 files changed, 646 insertions(+), 49 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73d94b7235..09c5327cb4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -83,6 +83,7 @@ static void show_sort_keys(SortState *sortstate, List *ancestors,
 			   ExplainState *es);
 static void show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 					   ExplainState *es);
+static void show_append_info(AppendState *astate, ExplainState *es);
 static void show_agg_keys(AggState *astate, List *ancestors,
 			  ExplainState *es);
 static void show_grouping_sets(PlanState *planstate, Agg *agg,
@@ -1168,6 +1169,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		}
 		if (plan->parallel_aware)
 			appendStringInfoString(es->str, "Parallel ");
+		if (plan->async_capable)
+			appendStringInfoString(es->str, "Async ");
 		appendStringInfoString(es->str, pname);
 		es->indent++;
 	}
@@ -1690,6 +1693,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		case T_Hash:
 			show_hash_info(castNode(HashState, planstate), es);
 			break;
+
+		case T_Append:
+			show_append_info(castNode(AppendState, planstate), es);
+			break;
+
 		default:
 			break;
 	}
@@ -2027,6 +2035,15 @@ show_merge_append_keys(MergeAppendState *mstate, List *ancestors,
 						 ancestors, es);
 }
 
+static void
+show_append_info(AppendState *astate, ExplainState *es)
+{
+	Append *plan = (Append *) astate->ps.plan;
+
+	if (plan->nasyncplans > 0)
+		ExplainPropertyInteger("Async subplans", "", plan->nasyncplans, es);
+}
+
 /*
  * Show the grouping keys for an Agg node.
  */
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index cc09895fa5..8ad2adfe1c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,7 +12,7 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execExpr.o execExprInterp.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execExpr.o execExprInterp.o \
        execGrouping.o execIndexing.o execJunk.o \
        execMain.o execParallel.o execPartition.o execProcnode.o \
        execReplication.o execScan.o execSRF.o execTuples.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000000..db477e2cf6
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,145 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execAsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/nodeAppend.h"
+#include "executor/nodeForeignscan.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+void ExecAsyncSetState(PlanState *pstate, AsyncState status)
+{
+	pstate->asyncstate = status;
+}
+
+bool
+ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+					   void *data, bool reinit)
+{
+	switch (nodeTag(node))
+	{
+	case T_ForeignScanState:
+		return ExecForeignAsyncConfigureWait((ForeignScanState *)node,
+											 wes, data, reinit);
+		break;
+	default:
+			elog(ERROR, "unrecognized node type: %d",
+				(int) nodeTag(node));
+	}
+}
+
+/*
+ * struct for memory context callback argument used in ExecAsyncEventWait
+ */
+typedef struct {
+	int **p_refind;
+	int *p_refindsize;
+} ExecAsync_mcbarg;
+
+/*
+ * callback function to reset static variables pointing to the memory in
+ * TopTransactionContext in ExecAsyncEventWait.
+ */
+static void ExecAsyncMemoryContextCallback(void *arg)
+{
+	/* arg is the address of the variable refind in ExecAsyncEventWait */
+	ExecAsync_mcbarg *mcbarg = (ExecAsync_mcbarg *) arg;
+	*mcbarg->p_refind = NULL;
+	*mcbarg->p_refindsize = 0;
+}
+
+#define EVENT_BUFFER_SIZE 16
+
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+	static int *refind = NULL;
+	static int refindsize = 0;
+	WaitEventSet *wes;
+	WaitEvent   occurred_event[EVENT_BUFFER_SIZE];
+	int noccurred = 0;
+	Bitmapset *fired_events = NULL;
+	int i;
+	int n;
+
+	n = bms_num_members(waitnodes);
+	wes = CreateWaitEventSet(TopTransactionContext,
+							 TopTransactionResourceOwner, n);
+	if (refindsize < n)
+	{
+		if (refindsize == 0)
+			refindsize = EVENT_BUFFER_SIZE; /* XXX */
+		while (refindsize < n)
+			refindsize *= 2;
+		if (refind)
+			refind = (int *) repalloc(refind, refindsize * sizeof(int));
+		else
+		{
+			static ExecAsync_mcbarg mcb_arg =
+				{ &refind, &refindsize };
+			static MemoryContextCallback mcb =
+				{ ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
+			MemoryContext oldctxt =
+				MemoryContextSwitchTo(TopTransactionContext);
+
+			/*
+			 * refind points to a memory block in
+			 * TopTransactionContext. Register a callback to reset it.
+			 */
+			MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
+			refind = (int *) palloc(refindsize * sizeof(int));
+			MemoryContextSwitchTo(oldctxt);
+		}
+	}
+
+	n = 0;
+	for (i = bms_next_member(waitnodes, -1) ; i >= 0 ;
+		 i = bms_next_member(waitnodes, i))
+	{
+		refind[i] = i;
+		if (ExecAsyncConfigureWait(wes, nodes[i], refind + i, true))
+			n++;
+	}
+
+	if (n == 0)
+	{
+		FreeWaitEventSet(wes);
+		return NULL;
+	}
+
+	noccurred = WaitEventSetWait(wes, timeout, occurred_event,
+								 EVENT_BUFFER_SIZE,
+								 WAIT_EVENT_ASYNC_WAIT);
+	FreeWaitEventSet(wes);
+	if (noccurred == 0)
+		return NULL;
+
+	for (i = 0 ; i < noccurred ; i++)
+	{
+		WaitEvent *w = &occurred_event[i];
+
+		if ((w->events & (WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE)) != 0)
+		{
+			int n = *(int*)w->user_data;
+
+			fired_events = bms_add_member(fired_events, n);
+		}
+	}
+
+	return fired_events;
+}
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6bc3e470bf..94fafe72fb 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -60,6 +60,7 @@
 #include "executor/execdebug.h"
 #include "executor/execPartition.h"
 #include "executor/nodeAppend.h"
+#include "executor/execAsync.h"
 #include "miscadmin.h"
 
 /* Shared state for parallel-aware Append. */
@@ -81,6 +82,7 @@ struct ParallelAppendState
 #define NO_MATCHING_SUBPLANS		-2
 
 static TupleTableSlot *ExecAppend(PlanState *pstate);
+static TupleTableSlot *ExecAppendAsync(PlanState *pstate);
 static bool choose_next_subplan_locally(AppendState *node);
 static bool choose_next_subplan_for_leader(AppendState *node);
 static bool choose_next_subplan_for_worker(AppendState *node);
@@ -104,13 +106,14 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	PlanState **appendplanstates;
 	Bitmapset  *validsubplans;
 	int			nplans;
+	int			nasyncplans;
 	int			firstvalid;
 	int			i,
 				j;
 	ListCell   *lc;
 
 	/* check for unsupported flags */
-	Assert(!(eflags & EXEC_FLAG_MARK));
+	Assert(!(eflags & (EXEC_FLAG_MARK | EXEC_FLAG_ASYNC)));
 
 	/*
 	 * Lock the non-leaf tables in the partition tree controlled by this node.
@@ -123,10 +126,15 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
-	appendstate->ps.ExecProcNode = ExecAppend;
+
+	/* choose appropriate version of Exec function */
+	if (node->nasyncplans == 0)
+		appendstate->ps.ExecProcNode = ExecAppend;
+	else
+		appendstate->ps.ExecProcNode = ExecAppendAsync;
 
 	/* Let choose_next_subplan_* function handle setting the first subplan */
-	appendstate->as_whichplan = INVALID_SUBPLAN_INDEX;
+	appendstate->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 
 	/* If run-time partition pruning is enabled, then set that up now */
 	if (node->part_prune_infos != NIL)
@@ -159,7 +167,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 			 */
 			if (bms_is_empty(validsubplans))
 			{
-				appendstate->as_whichplan = NO_MATCHING_SUBPLANS;
+				appendstate->as_whichsyncplan = NO_MATCHING_SUBPLANS;
 
 				/* Mark the first as valid so that it's initialized below */
 				validsubplans = bms_make_singleton(0);
@@ -213,11 +221,20 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 */
 	j = i = 0;
 	firstvalid = nplans;
+	nasyncplans = 0;
 	foreach(lc, node->appendplans)
 	{
 		if (bms_is_member(i, validsubplans))
 		{
 			Plan	   *initNode = (Plan *) lfirst(lc);
+			int			sub_eflags = eflags;
+
+			/* Let async-capable subplans run asynchronously */
+			if (i < node->nasyncplans)
+			{
+				sub_eflags |= EXEC_FLAG_ASYNC;
+				nasyncplans++;
+			}
 
 			/*
 			 * Record the lowest appendplans index which is a valid partial
@@ -226,7 +243,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 			if (i >= node->first_partial_plan && j < firstvalid)
 				firstvalid = j;
 
-			appendplanstates[j++] = ExecInitNode(initNode, estate, eflags);
+			appendplanstates[j++] = ExecInitNode(initNode, estate, sub_eflags);
 		}
 		i++;
 	}
@@ -235,6 +252,21 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
 
+	/* fill in async stuff */
+	appendstate->as_nasyncplans = nasyncplans;
+	appendstate->as_syncdone = (nasyncplans == nplans);
+
+	if (appendstate->as_nasyncplans)
+	{
+		appendstate->as_asyncresult = (TupleTableSlot **)
+			palloc0(node->nasyncplans * sizeof(TupleTableSlot *));
+
+		/* initially, all async requests need a request */
+		for (i = 0; i < appendstate->as_nasyncplans; ++i)
+			appendstate->as_needrequest =
+				bms_add_member(appendstate->as_needrequest, i);
+	}
+
 	/*
 	 * Miscellaneous initialization
 	 */
@@ -258,21 +290,23 @@ ExecAppend(PlanState *pstate)
 {
 	AppendState *node = castNode(AppendState, pstate);
 
-	if (node->as_whichplan < 0)
+	if (node->as_whichsyncplan < 0)
 	{
 		/*
 		 * If no subplan has been chosen, we must choose one before
 		 * proceeding.
 		 */
-		if (node->as_whichplan == INVALID_SUBPLAN_INDEX &&
+		if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
 			!node->choose_next_subplan(node))
 			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 
 		/* Nothing to do if there are no matching subplans */
-		else if (node->as_whichplan == NO_MATCHING_SUBPLANS)
+		else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
 			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
 	}
 
+	Assert(node->as_nasyncplans == 0);
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -283,8 +317,9 @@ ExecAppend(PlanState *pstate)
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		Assert(node->as_whichplan >= 0 && node->as_whichplan < node->as_nplans);
-		subnode = node->appendplans[node->as_whichplan];
+		Assert(node->as_whichsyncplan >= 0 &&
+			   node->as_whichsyncplan < node->as_nplans);
+		subnode = node->appendplans[node->as_whichsyncplan];
 
 		/*
 		 * get a tuple from the subplan
@@ -307,6 +342,175 @@ ExecAppend(PlanState *pstate)
 	}
 }
 
+static TupleTableSlot *
+ExecAppendAsync(PlanState *pstate)
+{
+	AppendState *node = castNode(AppendState, pstate);
+	Bitmapset *needrequest;
+	int	i;
+
+	Assert(node->as_nasyncplans > 0);
+
+restart:
+	if (node->as_nasyncresult > 0)
+	{
+		--node->as_nasyncresult;
+		return node->as_asyncresult[node->as_nasyncresult];
+	}
+
+	needrequest = node->as_needrequest;
+	node->as_needrequest = NULL;
+	while ((i = bms_first_member(needrequest)) >= 0)
+	{
+		TupleTableSlot *slot;
+		PlanState *subnode = node->appendplans[i];
+
+		slot = ExecProcNode(subnode);
+		if (subnode->asyncstate == AS_AVAILABLE)
+		{
+			if (!TupIsNull(slot))
+			{
+				node->as_asyncresult[node->as_nasyncresult++] = slot;
+				node->as_needrequest = bms_add_member(node->as_needrequest, i);
+			}
+		}
+		else
+			node->as_pending_async = bms_add_member(node->as_pending_async, i);
+	}
+	bms_free(needrequest);
+
+	for (;;)
+	{
+		TupleTableSlot *result;
+
+		/* return now if a result is available */
+		if (node->as_nasyncresult > 0)
+		{
+			--node->as_nasyncresult;
+			return node->as_asyncresult[node->as_nasyncresult];
+		}
+
+		while (!bms_is_empty(node->as_pending_async))
+		{
+			long timeout = node->as_syncdone ? -1 : 0;
+			Bitmapset *fired;
+			int i;
+
+			fired = ExecAsyncEventWait(node->appendplans,
+									   node->as_pending_async,
+									   timeout);
+
+			if (bms_is_empty(fired) && node->as_syncdone)
+			{
+				/*
+				 * No subplan fired. This happens when even in normal
+				 * operation where the subnode already prepared results before
+				 * waiting. as_pending_result is storing stale information so
+				 * restart from the beginning.
+				 */
+				node->as_needrequest = node->as_pending_async;
+				node->as_pending_async = NULL;
+				goto restart;
+			}
+
+			while ((i = bms_first_member(fired)) >= 0)
+			{
+				TupleTableSlot *slot;
+				PlanState *subnode = node->appendplans[i];
+				slot = ExecProcNode(subnode);
+				if (subnode->asyncstate == AS_AVAILABLE)
+				{
+					if (!TupIsNull(slot))
+					{
+						node->as_asyncresult[node->as_nasyncresult++] = slot;
+						node->as_needrequest =
+							bms_add_member(node->as_needrequest, i);
+					}
+					node->as_pending_async =
+						bms_del_member(node->as_pending_async, i);
+				}
+			}
+			bms_free(fired);
+
+			/* return now if a result is available */
+			if (node->as_nasyncresult > 0)
+			{
+				--node->as_nasyncresult;
+				return node->as_asyncresult[node->as_nasyncresult];
+			}
+
+			if (!node->as_syncdone)
+				break;
+		}
+
+		/*
+		 * If there is no asynchronous activity still pending and the
+		 * synchronous activity is also complete, we're totally done scanning
+		 * this node.  Otherwise, we're done with the asynchronous stuff but
+		 * must continue scanning the synchronous children.
+		 */
+		if (node->as_syncdone)
+		{
+			Assert(bms_is_empty(node->as_pending_async));
+			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		}
+
+		/*
+		 * get a tuple from the subplan
+		 */
+		
+		if (node->as_whichsyncplan < 0)
+		{
+			/*
+			 * If no subplan has been chosen, we must choose one before
+			 * proceeding.
+			 */
+			if (node->as_whichsyncplan == INVALID_SUBPLAN_INDEX &&
+				!node->choose_next_subplan(node))
+			{
+				node->as_syncdone = true;
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+
+			/* Nothing to do if there are no matching subplans */
+			else if (node->as_whichsyncplan == NO_MATCHING_SUBPLANS)
+			{
+				node->as_syncdone = true;
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		result = ExecProcNode(node->appendplans[node->as_whichsyncplan]);
+
+		if (!TupIsNull(result))
+		{
+			/*
+			 * If the subplan gave us something then return it as-is. We do
+			 * NOT make use of the result slot that was set up in
+			 * ExecInitAppend; there's no need for it.
+			 */
+			return result;
+		}
+
+		/*
+		 * Go on to the "next" subplan. If no more subplans, return the empty
+		 * slot set up for us by ExecInitAppend, unless there are async plans
+		 * we have yet to finish.
+		 */
+		if (!node->choose_next_subplan(node))
+		{
+			node->as_syncdone = true;
+			if (bms_is_empty(node->as_pending_async))
+			{
+				Assert(bms_is_empty(node->as_needrequest));
+				return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+			}
+		}
+
+		/* Else loop back and try to get a tuple from the new subplan */
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecEndAppend
  *
@@ -353,6 +557,15 @@ ExecReScanAppend(AppendState *node)
 		node->as_valid_subplans = NULL;
 	}
 
+	/* Reset async state. */
+	for (i = 0; i < node->as_nasyncplans; ++i)
+	{
+		ExecShutdownNode(node->appendplans[i]);
+		node->as_needrequest = bms_add_member(node->as_needrequest, i);
+	}
+	node->as_nasyncresult = 0;
+	node->as_syncdone = (node->as_nasyncplans == node->as_nplans);
+
 	for (i = 0; i < node->as_nplans; i++)
 	{
 		PlanState  *subnode = node->appendplans[i];
@@ -373,7 +586,7 @@ ExecReScanAppend(AppendState *node)
 	}
 
 	/* Let choose_next_subplan_* function handle setting the first subplan */
-	node->as_whichplan = INVALID_SUBPLAN_INDEX;
+	node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 }
 
 /* ----------------------------------------------------------------
@@ -461,7 +674,7 @@ ExecAppendInitializeWorker(AppendState *node, ParallelWorkerContext *pwcxt)
 static bool
 choose_next_subplan_locally(AppendState *node)
 {
-	int			whichplan = node->as_whichplan;
+	int			whichplan = node->as_whichsyncplan;
 	int			nextplan;
 
 	/* We should never be called when there are no subplans */
@@ -480,6 +693,10 @@ choose_next_subplan_locally(AppendState *node)
 			node->as_valid_subplans =
 				ExecFindMatchingSubPlans(node->as_prune_state);
 
+		/* Exclude async plans */
+		if (node->as_nasyncplans > 0)
+			bms_del_range(node->as_valid_subplans, 0, node->as_nasyncplans - 1);
+
 		whichplan = -1;
 	}
 
@@ -494,7 +711,7 @@ choose_next_subplan_locally(AppendState *node)
 	if (nextplan < 0)
 		return false;
 
-	node->as_whichplan = nextplan;
+	node->as_whichsyncplan = nextplan;
 
 	return true;
 }
@@ -516,19 +733,19 @@ choose_next_subplan_for_leader(AppendState *node)
 	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
 	/* We should never be called when there are no subplans */
-	Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+	Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
 	{
 		/* Mark just-completed subplan as finished. */
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 	}
 	else
 	{
 		/* Start with last subplan. */
-		node->as_whichplan = node->as_nplans - 1;
+		node->as_whichsyncplan = node->as_nplans - 1;
 
 		/*
 		 * If we've yet to determine the valid subplans for these parameters
@@ -549,12 +766,12 @@ choose_next_subplan_for_leader(AppendState *node)
 	}
 
 	/* Loop until we find a subplan to execute. */
-	while (pstate->pa_finished[node->as_whichplan])
+	while (pstate->pa_finished[node->as_whichsyncplan])
 	{
-		if (node->as_whichplan == 0)
+		if (node->as_whichsyncplan == 0)
 		{
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
-			node->as_whichplan = INVALID_SUBPLAN_INDEX;
+			node->as_whichsyncplan = INVALID_SUBPLAN_INDEX;
 			LWLockRelease(&pstate->pa_lock);
 			return false;
 		}
@@ -563,12 +780,12 @@ choose_next_subplan_for_leader(AppendState *node)
 		 * We needn't pay attention to as_valid_subplans here as all invalid
 		 * plans have been marked as finished.
 		 */
-		node->as_whichplan--;
+		node->as_whichsyncplan--;
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < node->as_first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < node->as_first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
@@ -597,13 +814,13 @@ choose_next_subplan_for_worker(AppendState *node)
 	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
 
 	/* We should never be called when there are no subplans */
-	Assert(node->as_whichplan != NO_MATCHING_SUBPLANS);
+	Assert(node->as_whichsyncplan != NO_MATCHING_SUBPLANS);
 
 	LWLockAcquire(&pstate->pa_lock, LW_EXCLUSIVE);
 
 	/* Mark just-completed subplan as finished. */
-	if (node->as_whichplan != INVALID_SUBPLAN_INDEX)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan != INVALID_SUBPLAN_INDEX)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	/*
 	 * If we've yet to determine the valid subplans for these parameters then
@@ -625,7 +842,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Save the plan from which we are starting the search. */
-	node->as_whichplan = pstate->pa_next_plan;
+	node->as_whichsyncplan = pstate->pa_next_plan;
 
 	/* Loop until we find a valid subplan to execute. */
 	while (pstate->pa_finished[pstate->pa_next_plan])
@@ -639,7 +856,7 @@ choose_next_subplan_for_worker(AppendState *node)
 			/* Advance to the next valid plan. */
 			pstate->pa_next_plan = nextplan;
 		}
-		else if (node->as_whichplan > node->as_first_partial_plan)
+		else if (node->as_whichsyncplan > node->as_first_partial_plan)
 		{
 			/*
 			 * Try looping back to the first valid partial plan, if there is
@@ -648,7 +865,7 @@ choose_next_subplan_for_worker(AppendState *node)
 			nextplan = bms_next_member(node->as_valid_subplans,
 									   node->as_first_partial_plan - 1);
 			pstate->pa_next_plan =
-				nextplan < 0 ? node->as_whichplan : nextplan;
+				nextplan < 0 ? node->as_whichsyncplan : nextplan;
 		}
 		else
 		{
@@ -656,10 +873,10 @@ choose_next_subplan_for_worker(AppendState *node)
 			 * At last plan, and either there are no partial plans or we've
 			 * tried them all.  Arrange to bail out.
 			 */
-			pstate->pa_next_plan = node->as_whichplan;
+			pstate->pa_next_plan = node->as_whichsyncplan;
 		}
 
-		if (pstate->pa_next_plan == node->as_whichplan)
+		if (pstate->pa_next_plan == node->as_whichsyncplan)
 		{
 			/* We've tried everything! */
 			pstate->pa_next_plan = INVALID_SUBPLAN_INDEX;
@@ -669,7 +886,7 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* Pick the plan we found, and advance pa_next_plan one more time. */
-	node->as_whichplan = pstate->pa_next_plan;
+	node->as_whichsyncplan = pstate->pa_next_plan;
 	pstate->pa_next_plan = bms_next_member(node->as_valid_subplans,
 										   pstate->pa_next_plan);
 
@@ -696,8 +913,8 @@ choose_next_subplan_for_worker(AppendState *node)
 	}
 
 	/* If non-partial, immediately mark as finished. */
-	if (node->as_whichplan < node->as_first_partial_plan)
-		node->as_pstate->pa_finished[node->as_whichplan] = true;
+	if (node->as_whichsyncplan < node->as_first_partial_plan)
+		node->as_pstate->pa_finished[node->as_whichsyncplan] = true;
 
 	LWLockRelease(&pstate->pa_lock);
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index a2a28b7ec2..915deb7080 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -123,7 +123,6 @@ ExecForeignScan(PlanState *pstate)
 					(ExecScanRecheckMtd) ForeignRecheck);
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitForeignScan
  * ----------------------------------------------------------------
@@ -147,6 +146,10 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
 	scanstate->ss.ps.ExecProcNode = ExecForeignScan;
+	scanstate->ss.ps.asyncstate = AS_AVAILABLE;
+
+	if ((eflags & EXEC_FLAG_ASYNC) != 0)
+		scanstate->fs_async = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -387,3 +390,20 @@ ExecShutdownForeignScan(ForeignScanState *node)
 	if (fdwroutine->ShutdownForeignScan)
 		fdwroutine->ShutdownForeignScan(node);
 }
+
+/* ----------------------------------------------------------------
+ *		ExecAsyncForeignScanConfigureWait
+ *
+ *		In async mode, configure for a wait
+ * ----------------------------------------------------------------
+ */
+bool
+ExecForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+							  void *caller_data, bool reinit)
+{
+	FdwRoutine *fdwroutine = node->fdwroutine;
+
+	Assert(fdwroutine->ForeignAsyncConfigureWait != NULL);
+	return fdwroutine->ForeignAsyncConfigureWait(node, wes,
+												 caller_data, reinit);
+}
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 9bf9a29d6b..b2ab879d49 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -922,6 +922,78 @@ bms_add_range(Bitmapset *a, int lower, int upper)
 	return a;
 }
 
+/*
+ * bms_del_range
+ *		Delete members in the range of 'lower' to 'upper' from the set.
+ *
+ * Note this could also be done by calling bms_del_member in a loop, however,
+ * using this function will be faster when the range is large as we work at
+ * the bitmapword level rather than at bit level.
+ */
+Bitmapset *
+bms_del_range(Bitmapset *a, int lower, int upper)
+{
+	int			lwordnum,
+				lbitnum,
+				uwordnum,
+				ushiftbits,
+				wordnum;
+
+	if (lower < 0 || upper < 0)
+		elog(ERROR, "negative bitmapset member not allowed");
+	if (lower > upper)
+		elog(ERROR, "lower range must not be above upper range");
+	uwordnum = WORDNUM(upper);
+
+	if (a == NULL)
+	{
+		a = (Bitmapset *) palloc0(BITMAPSET_SIZE(uwordnum + 1));
+		a->nwords = uwordnum + 1;
+	}
+
+	/* ensure we have enough words to store the upper bit */
+	else if (uwordnum >= a->nwords)
+	{
+		int			oldnwords = a->nwords;
+		int			i;
+
+		a = (Bitmapset *) repalloc(a, BITMAPSET_SIZE(uwordnum + 1));
+		a->nwords = uwordnum + 1;
+		/* zero out the enlarged portion */
+		for (i = oldnwords; i < a->nwords; i++)
+			a->words[i] = 0;
+	}
+
+	wordnum = lwordnum = WORDNUM(lower);
+
+	lbitnum = BITNUM(lower);
+	ushiftbits = BITNUM(upper) + 1;
+
+	/*
+	 * Special case when lwordnum is the same as uwordnum we must perform the
+	 * upper and lower masking on the word.
+	 */
+	if (lwordnum == uwordnum)
+	{
+		a->words[lwordnum] &= ((bitmapword) (((bitmapword) 1 << lbitnum) - 1)
+								| (~(bitmapword) 0) << ushiftbits);
+	}
+	else
+	{
+		/* turn off lbitnum and all bits left of it */
+		a->words[wordnum++] &= (bitmapword) (((bitmapword) 1 << lbitnum) - 1);
+
+		/* turn off all bits for any intermediate words */
+		while (wordnum < uwordnum)
+			a->words[wordnum++] = (bitmapword) 0;
+
+		/* turn off upper's bit and all bits right of it. */
+		a->words[uwordnum] &= (~(bitmapword) 0) << ushiftbits;
+	}
+
+	return a;
+}
+
 /*
  * bms_int_members - like bms_intersect, but left input is recycled
  */
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 7c045a7afe..8304dd5b17 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -246,6 +246,8 @@ _copyAppend(const Append *from)
 	COPY_NODE_FIELD(appendplans);
 	COPY_SCALAR_FIELD(first_partial_plan);
 	COPY_NODE_FIELD(part_prune_infos);
+	COPY_SCALAR_FIELD(nasyncplans);
+	COPY_SCALAR_FIELD(referent);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 1da9d7ed15..ed655f4ccb 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -403,6 +403,8 @@ _outAppend(StringInfo str, const Append *node)
 	WRITE_NODE_FIELD(appendplans);
 	WRITE_INT_FIELD(first_partial_plan);
 	WRITE_NODE_FIELD(part_prune_infos);
+	WRITE_INT_FIELD(nasyncplans);
+	WRITE_INT_FIELD(referent);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 2826cec2f8..fb4ae251de 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -1652,6 +1652,8 @@ _readAppend(void)
 	READ_NODE_FIELD(appendplans);
 	READ_INT_FIELD(first_partial_plan);
 	READ_NODE_FIELD(part_prune_infos);
+	READ_INT_FIELD(nasyncplans);
+	READ_INT_FIELD(referent);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0317763f43..eda3420d02 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -211,7 +211,9 @@ static NamedTuplestoreScan *make_namedtuplestorescan(List *qptlist, List *qpqual
 static WorkTableScan *make_worktablescan(List *qptlist, List *qpqual,
 				   Index scanrelid, int wtParam);
 static Append *make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels, List *partpruneinfos);
+						   int nasyncplans,	int referent,
+						   List *tlist,
+						   List *partitioned_rels, List *partpruneinfos);
 static RecursiveUnion *make_recursive_union(List *tlist,
 					 Plan *lefttree,
 					 Plan *righttree,
@@ -294,6 +296,7 @@ static ModifyTable *make_modifytable(PlannerInfo *root,
 				 List *rowMarks, OnConflictExpr *onconflict, int epqParam);
 static GatherMerge *create_gather_merge_plan(PlannerInfo *root,
 						 GatherMergePath *best_path);
+static bool is_async_capable_path(Path *path);
 
 
 /*
@@ -1036,10 +1039,14 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 {
 	Append	   *plan;
 	List	   *tlist = build_path_tlist(root, &best_path->path);
-	List	   *subplans = NIL;
+	List	   *asyncplans = NIL;
+	List	   *syncplans = NIL;
 	ListCell   *subpaths;
 	RelOptInfo *rel = best_path->path.parent;
 	List	   *partpruneinfos = NIL;
+	int			nasyncplans = 0;
+	bool		first = true;
+	bool		referent_is_sync = true;
 
 	/*
 	 * The subpaths list could be empty, if every child was proven empty by
@@ -1074,7 +1081,22 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 		/* Must insist that all children return the same tlist */
 		subplan = create_plan_recurse(root, subpath, CP_EXACT_TLIST);
 
-		subplans = lappend(subplans, subplan);
+		/*
+		 * Classify as async-capable or not. If we have decided to run the
+		 * chidlren in parallel, we cannot any one of them run asynchronously.
+		 */
+		if (!best_path->path.parallel_safe && is_async_capable_path(subpath))
+		{
+			subplan->async_capable = true;
+			asyncplans = lappend(asyncplans, subplan);
+			++nasyncplans;
+			if (first)
+				referent_is_sync = false;
+		}
+		else
+			syncplans = lappend(syncplans, subplan);
+
+		first = false;
 	}
 
 	if (enable_partition_pruning &&
@@ -1117,9 +1139,10 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path)
 	 * parent-rel Vars it'll be asked to emit.
 	 */
 
-	plan = make_append(subplans, best_path->first_partial_path,
-					   tlist, best_path->partitioned_rels,
-					   partpruneinfos);
+	plan = make_append(list_concat(asyncplans, syncplans),
+					   best_path->first_partial_path, nasyncplans,
+					   referent_is_sync ? nasyncplans : 0, tlist,
+					   best_path->partitioned_rels, partpruneinfos);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
 
@@ -5414,9 +5437,9 @@ make_foreignscan(List *qptlist,
 }
 
 static Append *
-make_append(List *appendplans, int first_partial_plan,
-			List *tlist, List *partitioned_rels,
-			List *partpruneinfos)
+make_append(List *appendplans, int first_partial_plan, int nasyncplans,
+			int referent, List *tlist,
+			List *partitioned_rels, List *partpruneinfos)
 {
 	Append	   *node = makeNode(Append);
 	Plan	   *plan = &node->plan;
@@ -5429,6 +5452,9 @@ make_append(List *appendplans, int first_partial_plan,
 	node->appendplans = appendplans;
 	node->first_partial_plan = first_partial_plan;
 	node->part_prune_infos = partpruneinfos;
+	node->nasyncplans = nasyncplans;
+	node->referent = referent;
+
 	return node;
 }
 
@@ -6773,3 +6799,27 @@ is_projection_capable_plan(Plan *plan)
 	}
 	return true;
 }
+
+/*
+ * is_projection_capable_path
+ *		Check whether a given Path node is async-capable.
+ */
+static bool
+is_async_capable_path(Path *path)
+{
+	switch (nodeTag(path))
+	{
+		case T_ForeignPath:
+			{
+				FdwRoutine *fdwroutine = path->parent->fdwroutine;
+
+				Assert(fdwroutine != NULL);
+				if (fdwroutine->IsForeignPathAsyncCapable != NULL &&
+					fdwroutine->IsForeignPathAsyncCapable((ForeignPath *) path))
+					return true;
+			}
+		default:
+			break;
+	}
+	return false;
+}
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c..7aef97ca97 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3683,6 +3683,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
 		case WAIT_EVENT_SYNC_REP:
 			event_name = "SyncRep";
 			break;
+		case WAIT_EVENT_ASYNC_WAIT:
+			event_name = "AsyncExecWait";
+			break;
 			/* no default case, so that compiler will warn */
 	}
 
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 065238b0fe..fe202cbfea 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -4513,7 +4513,7 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	dpns->planstate = ps;
 
 	/*
-	 * We special-case Append and MergeAppend to pretend that the first child
+	 * We special-case Append and MergeAppend to pretend that a specific child
 	 * plan is the OUTER referent; we have to interpret OUTER Vars in their
 	 * tlists according to one of the children, and the first one is the most
 	 * natural choice.  Likewise special-case ModifyTable to pretend that the
@@ -4521,7 +4521,11 @@ set_deparse_planstate(deparse_namespace *dpns, PlanState *ps)
 	 * lists containing references to non-target relations.
 	 */
 	if (IsA(ps, AppendState))
-		dpns->outer_planstate = ((AppendState *) ps)->appendplans[0];
+	{
+		AppendState *aps = (AppendState *) ps;
+		Append *app = (Append *) ps->plan;
+		dpns->outer_planstate = aps->appendplans[app->referent];
+	}
 	else if (IsA(ps, MergeAppendState))
 		dpns->outer_planstate = ((MergeAppendState *) ps)->mergeplans[0];
 	else if (IsA(ps, ModifyTableState))
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000000..5fd67d9004
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.c
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/backend/executor/execAsync.c
+ *--------------------------------------------------------------------
+ */
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+#include "storage/latch.h"
+
+extern void ExecAsyncSetState(PlanState *pstate, AsyncState status);
+extern bool ExecAsyncConfigureWait(WaitEventSet *wes, PlanState *node,
+								   void *data, bool reinit);
+extern Bitmapset *ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes,
+									 long timeout);
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a7ea3c7d10..8e9d87669f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -63,6 +63,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request async execution */
 
 
 /* Hook for plugins to get control in ExecutorStart() */
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index ccb66be733..67abf8e52e 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -30,5 +30,8 @@ extern void ExecForeignScanReInitializeDSM(ForeignScanState *node,
 extern void ExecForeignScanInitializeWorker(ForeignScanState *node,
 								ParallelWorkerContext *pwcxt);
 extern void ExecShutdownForeignScan(ForeignScanState *node);
+extern bool ExecForeignAsyncConfigureWait(ForeignScanState *node,
+										  WaitEventSet *wes,
+										  void *caller_data, bool reinit);
 
 #endif							/* NODEFOREIGNSCAN_H */
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index c14eb546c6..c00e9621fb 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -168,6 +168,11 @@ typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 typedef List *(*ReparameterizeForeignPathByChild_function) (PlannerInfo *root,
 															List *fdw_private,
 															RelOptInfo *child_rel);
+typedef bool (*IsForeignPathAsyncCapable_function) (ForeignPath *path);
+typedef bool (*ForeignAsyncConfigureWait_function) (ForeignScanState *node,
+													WaitEventSet *wes,
+													void *caller_data,
+													bool reinit);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -189,6 +194,7 @@ typedef struct FdwRoutine
 	GetForeignPlan_function GetForeignPlan;
 	BeginForeignScan_function BeginForeignScan;
 	IterateForeignScan_function IterateForeignScan;
+	IterateForeignScan_function IterateForeignScanAsync;
 	ReScanForeignScan_function ReScanForeignScan;
 	EndForeignScan_function EndForeignScan;
 
@@ -241,6 +247,11 @@ typedef struct FdwRoutine
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	ReInitializeDSMForeignScan_function ReInitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	IsForeignPathAsyncCapable_function IsForeignPathAsyncCapable;
+	ForeignAsyncConfigureWait_function ForeignAsyncConfigureWait;
+
 	ShutdownForeignScan_function ShutdownForeignScan;
 
 	/* Support functions for path reparameterization. */
diff --git a/src/include/nodes/bitmapset.h b/src/include/nodes/bitmapset.h
index b6f1a9e6e5..41f0927934 100644
--- a/src/include/nodes/bitmapset.h
+++ b/src/include/nodes/bitmapset.h
@@ -94,6 +94,7 @@ extern Bitmapset *bms_add_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_add_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_int_members(Bitmapset *a, const Bitmapset *b);
 extern Bitmapset *bms_del_members(Bitmapset *a, const Bitmapset *b);
+extern Bitmapset *bms_del_range(Bitmapset *a, int lower, int upper);
 extern Bitmapset *bms_join(Bitmapset *a, Bitmapset *b);
 
 /* support for iterating through the integer elements of a set: */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7f52cab0..56bfe3f442 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -905,6 +905,12 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (struct PlanState *pstate);
  * abstract superclass for all PlanState-type nodes.
  * ----------------
  */
+typedef enum AsyncState
+{
+	AS_AVAILABLE,
+	AS_WAITING
+} AsyncState;
+
 typedef struct PlanState
 {
 	NodeTag		type;
@@ -953,6 +959,9 @@ typedef struct PlanState
 	 * descriptor, without encoding knowledge about all executor nodes.
 	 */
 	TupleDesc	scandesc;
+
+	AsyncState	asyncstate;
+	int32		padding;			/* to keep alignment of derived types */
 } PlanState;
 
 /* ----------------
@@ -1087,14 +1096,20 @@ struct AppendState
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
 	int			as_nplans;
-	int			as_whichplan;
+	int			as_whichsyncplan; /* which sync plan is being executed  */
 	int			as_first_partial_plan;	/* Index of 'appendplans' containing
 										 * the first partial plan */
+	int			as_nasyncplans;	/* # of async-capable children */
 	ParallelAppendState *as_pstate; /* parallel coordination info */
 	Size		pstate_len;		/* size of parallel coordination info */
 	struct PartitionPruneState *as_prune_state;
 	Bitmapset  *as_valid_subplans;
 	bool		(*choose_next_subplan) (AppendState *);
+	bool		as_syncdone;	/* all synchronous plans done? */
+	Bitmapset  *as_needrequest;	/* async plans needing a new request */
+	Bitmapset  *as_pending_async;	/* pending async plans */
+	TupleTableSlot **as_asyncresult;	/* unreturned results of async plans */
+	int			as_nasyncresult;	/* # of valid entries in as_asyncresult */
 };
 
 /* ----------------
@@ -1643,6 +1658,7 @@ typedef struct ForeignScanState
 	Size		pscan_len;		/* size of parallel coordination information */
 	/* use struct pointer to avoid including fdwapi.h here */
 	struct FdwRoutine *fdwroutine;
+	bool		fs_async;
 	void	   *fdw_state;		/* foreign-data wrapper can keep state here */
 } ForeignScanState;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index f2dda82e66..8a64c037c9 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -139,6 +139,11 @@ typedef struct Plan
 	bool		parallel_aware; /* engage parallel-aware logic? */
 	bool		parallel_safe;	/* OK to use as part of parallel plan? */
 
+	/*
+	 * information needed for asynchronous execution
+	 */
+	bool		async_capable;  /* engage asyncronous execution logic? */
+
 	/*
 	 * Common structural data for all Plan types.
 	 */
@@ -262,6 +267,8 @@ typedef struct Append
 	 * Mapping details for run-time subplan pruning, one per partitioned_rels
 	 */
 	List	   *part_prune_infos;
+	int			nasyncplans;	/* # of async plans, always at start of list */
+	int			referent; 		/* index of inheritance tree referent */
 } Append;
 
 /* ----------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239b..6f4583b46c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -832,7 +832,8 @@ typedef enum
 	WAIT_EVENT_REPLICATION_ORIGIN_DROP,
 	WAIT_EVENT_REPLICATION_SLOT_DROP,
 	WAIT_EVENT_SAFE_SNAPSHOT,
-	WAIT_EVENT_SYNC_REP
+	WAIT_EVENT_SYNC_REP,
+	WAIT_EVENT_ASYNC_WAIT
 } WaitEventIPC;
 
 /* ----------
-- 
2.16.3

0003-async-postgres_fdw.patchtext/x-patch; charset=us-asciiDownload

From 72120b5c2b0775d33186dec7d4fc206e63094c20 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 19 Oct 2017 17:24:07 +0900
Subject: [PATCH 3/3] async postgres_fdw

---
 contrib/postgres_fdw/connection.c              |  26 +
 contrib/postgres_fdw/expected/postgres_fdw.out | 198 ++++----
 contrib/postgres_fdw/postgres_fdw.c            | 633 ++++++++++++++++++++++---
 contrib/postgres_fdw/postgres_fdw.h            |   2 +
 contrib/postgres_fdw/sql/postgres_fdw.sql      |  20 +-
 5 files changed, 708 insertions(+), 171 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index fe4893a8e0..da7c826e4f 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -58,6 +58,7 @@ typedef struct ConnCacheEntry
 	bool		invalidated;	/* true if reconnect is pending */
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -202,6 +203,7 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 
 		elog(DEBUG3, "new postgres_fdw connection %p for server \"%s\" (user mapping oid %u, userid %u)",
 			 entry->conn, server->servername, user->umid, user->userid);
+		entry->storage = NULL;
 	}
 
 	/*
@@ -215,6 +217,30 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 	return entry->conn;
 }
 
+/*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	bool		found;
+	ConnCacheEntry *entry;
+	ConnCacheKey key;
+
+	key = user->umid;
+	entry = hash_search(ConnectionHash, &key, HASH_ENTER, &found);
+	Assert(found);
+
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
 /*
  * Connect to remote server using specified server and user mapping properties.
  */
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index bb6b1a8fdf..cddc207c04 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -6793,7 +6793,7 @@ INSERT INTO a(aa) VALUES('aaaaa');
 INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |  aa   
 ----------+-------
  a        | aaa
@@ -6821,7 +6821,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6849,7 +6849,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE b SET aa = 'new';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | aaa
@@ -6877,7 +6877,7 @@ SELECT tableoid::regclass, * FROM ONLY a;
 (3 rows)
 
 UPDATE a SET aa = 'newtoo';
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
  tableoid |   aa   
 ----------+--------
  a        | newtoo
@@ -6947,35 +6947,41 @@ insert into bar2 values(3,33,33);
 insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+                                                   QUERY PLAN                                                    
+-----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR UPDATE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR UPDATE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
  f1 | f2 
 ----+----
   1 | 11
@@ -6985,35 +6991,41 @@ select * from bar where f1 in (select f1 from foo) for update;
 (4 rows)
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-                                          QUERY PLAN                                          
-----------------------------------------------------------------------------------------------
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+                                                   QUERY PLAN                                                   
+----------------------------------------------------------------------------------------------------------------
  LockRows
    Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
-   ->  Hash Join
+   ->  Merge Join
          Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
-         Hash Cond: (bar.f1 = foo.f1)
-         ->  Append
-               ->  Seq Scan on public.bar
+         Merge Cond: (bar.f1 = foo.f1)
+         ->  Merge Append
+               Sort Key: bar.f1
+               ->  Sort
                      Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
+                     Sort Key: bar.f1
+                     ->  Seq Scan on public.bar
+                           Output: bar.f1, bar.f2, bar.ctid, bar.*, bar.tableoid
                ->  Foreign Scan on public.bar2
                      Output: bar2.f1, bar2.f2, bar2.ctid, bar2.*, bar2.tableoid
-                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 FOR SHARE
-         ->  Hash
+                     Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct2 ORDER BY f1 ASC NULLS LAST FOR SHARE
+         ->  Sort
                Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+               Sort Key: foo.f1
                ->  HashAggregate
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(23 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(29 rows)
 
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
  f1 | f2 
 ----+----
   1 | 11
@@ -7043,11 +7055,12 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
    ->  Hash Join
          Output: bar2.f1, (bar2.f2 + 100), bar2.f3, bar2.ctid, foo.ctid, foo.*, foo.tableoid
          Inner Unique: true
@@ -7061,12 +7074,13 @@ update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
                      Output: foo.ctid, foo.*, foo.tableoid, foo.f1
                      Group Key: foo.f1
                      ->  Append
-                           ->  Seq Scan on public.foo
-                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
-                           ->  Foreign Scan on public.foo2
+                           Async subplans: 1 
+                           ->  Async Foreign Scan on public.foo2
                                  Output: foo2.ctid, foo2.*, foo2.tableoid, foo2.f1
                                  Remote SQL: SELECT f1, f2, f3, ctid FROM public.loct1
-(39 rows)
+                           ->  Seq Scan on public.foo
+                                 Output: foo.ctid, foo.*, foo.tableoid, foo.f1
+(41 rows)
 
 update bar set f2 = f2 + 100 where f1 in (select f1 from foo);
 select tableoid::regclass, * from bar order by 1,2;
@@ -7096,16 +7110,17 @@ where bar.f1 = ss.f1;
          Output: bar.f1, (bar.f2 + 100), bar.ctid, (ROW(foo.f1))
          Hash Cond: (foo.f1 = bar.f1)
          ->  Append
-               ->  Seq Scan on public.foo
-                     Output: ROW(foo.f1), foo.f1
-               ->  Foreign Scan on public.foo2
+               Async subplans: 2 
+               ->  Async Foreign Scan on public.foo2
                      Output: ROW(foo2.f1), foo2.f1
                      Remote SQL: SELECT f1 FROM public.loct1
-               ->  Seq Scan on public.foo foo_1
-                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-               ->  Foreign Scan on public.foo2 foo2_1
+               ->  Async Foreign Scan on public.foo2 foo2_1
                      Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                      Remote SQL: SELECT f1 FROM public.loct1
+               ->  Seq Scan on public.foo
+                     Output: ROW(foo.f1), foo.f1
+               ->  Seq Scan on public.foo foo_1
+                     Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
          ->  Hash
                Output: bar.f1, bar.f2, bar.ctid
                ->  Seq Scan on public.bar
@@ -7123,17 +7138,18 @@ where bar.f1 = ss.f1;
                Output: (ROW(foo.f1)), foo.f1
                Sort Key: foo.f1
                ->  Append
-                     ->  Seq Scan on public.foo
-                           Output: ROW(foo.f1), foo.f1
-                     ->  Foreign Scan on public.foo2
+                     Async subplans: 2 
+                     ->  Async Foreign Scan on public.foo2
                            Output: ROW(foo2.f1), foo2.f1
                            Remote SQL: SELECT f1 FROM public.loct1
-                     ->  Seq Scan on public.foo foo_1
-                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
-                     ->  Foreign Scan on public.foo2 foo2_1
+                     ->  Async Foreign Scan on public.foo2 foo2_1
                            Output: ROW((foo2_1.f1 + 3)), (foo2_1.f1 + 3)
                            Remote SQL: SELECT f1 FROM public.loct1
-(45 rows)
+                     ->  Seq Scan on public.foo
+                           Output: ROW(foo.f1), foo.f1
+                     ->  Seq Scan on public.foo foo_1
+                           Output: ROW((foo_1.f1 + 3)), (foo_1.f1 + 3)
+(47 rows)
 
 update bar set f2 = f2 + 100
 from
@@ -7283,27 +7299,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
@@ -8155,11 +8177,12 @@ SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER J
  Sort
    Sort Key: t1.a, t3.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)) INNER JOIN (public.ftprt1_p1 t3)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: ((public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)) INNER JOIN (public.ftprt1_p2 t3)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t2.b,t3.c FROM fprt1 t1 INNER JOIN fprt2 t2 ON (t1.a = t2.b) INNER JOIN fprt1 t3 ON (t2.b = t3.a) WHERE t1.a % 25 =0 ORDER BY 1,2,3;
   a  |  b  |  c   
@@ -8178,9 +8201,10 @@ SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10)
  Sort
    Sort Key: t1.a, ftprt2_p1.b, ftprt2_p1.c
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 1 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) LEFT JOIN (public.ftprt2_p1 fprt2)
-(5 rows)
+(6 rows)
 
 SELECT t1.a,t2.b,t2.c FROM fprt1 t1 LEFT JOIN (SELECT * FROM fprt2 WHERE a < 10) t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a < 10 ORDER BY 1,2,3;
  a | b |  c   
@@ -8200,11 +8224,12 @@ SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE
  Sort
    Sort Key: ((t1.*)::fprt1), ((t2.*)::fprt2)
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1,t2 FROM fprt1 t1 JOIN fprt2 t2 ON (t1.a = t2.b and t1.b = t2.a) WHERE t1.a % 25 =0 ORDER BY 1,2;
        t1       |       t2       
@@ -8223,11 +8248,12 @@ SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t
  Sort
    Sort Key: t1.a, t1.b
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 2 
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p1 t1) INNER JOIN (public.ftprt2_p1 t2)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: (public.ftprt1_p2 t1) INNER JOIN (public.ftprt2_p2 t2)
-(7 rows)
+(8 rows)
 
 SELECT t1.a,t1.b FROM fprt1 t1, LATERAL (SELECT t2.a, t2.b FROM fprt2 t2 WHERE t1.a = t2.b AND t1.b = t2.a) q WHERE t1.a%25 = 0 ORDER BY 1,2;
   a  |  b  
@@ -8309,10 +8335,11 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
          Group Key: fpagg_tab_p1.a
          Filter: (avg(fpagg_tab_p1.b) < '22'::numeric)
          ->  Append
-               ->  Foreign Scan on fpagg_tab_p1
-               ->  Foreign Scan on fpagg_tab_p2
-               ->  Foreign Scan on fpagg_tab_p3
-(9 rows)
+               Async subplans: 3 
+               ->  Async Foreign Scan on fpagg_tab_p1
+               ->  Async Foreign Scan on fpagg_tab_p2
+               ->  Async Foreign Scan on fpagg_tab_p3
+(10 rows)
 
 -- Plan with partitionwise aggregates is enabled
 SET enable_partitionwise_aggregate TO true;
@@ -8323,13 +8350,14 @@ SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 O
  Sort
    Sort Key: fpagg_tab_p1.a
    ->  Append
-         ->  Foreign Scan
+         Async subplans: 3 
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p1 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p2 pagg_tab)
-         ->  Foreign Scan
+         ->  Async Foreign Scan
                Relations: Aggregate on (public.fpagg_tab_p3 pagg_tab)
-(9 rows)
+(10 rows)
 
 SELECT a, sum(b), min(b), count(*) FROM pagg_tab GROUP BY a HAVING avg(b) < 22 ORDER BY 1;
  a  | sum  | min | count 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 78b0f43ca8..51d19cc421 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -20,6 +20,8 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
+#include "executor/nodeForeignscan.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -34,6 +36,7 @@
 #include "optimizer/var.h"
 #include "optimizer/tlist.h"
 #include "parser/parsetree.h"
+#include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/lsyscache.h"
@@ -53,6 +56,9 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
+
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -119,11 +125,28 @@ enum FdwDirectModifyPrivateIndex
 	FdwDirectModifyPrivateSetProcessed
 };
 
+/*
+ * Connection private area structure.
+ */
+typedef struct PgFdwConnpriv
+{
+	ForeignScanState   *leader;		/* leader node of this connection */
+	bool				busy;		/* true if this connection is busy */
+} PgFdwConnpriv;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnpriv *connpriv;	/* connection private memory */
+} PgFdwState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -134,7 +157,7 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
+	bool		result_ready;
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -150,6 +173,12 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		run_async;		/* true if run asynchronously */
+	bool		inqueue;		/* true if this node is in waiter queue */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* last waiting node in waiting queue.
+									 * valid only on the leader node */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -163,11 +192,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -190,6 +219,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -293,6 +323,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -358,6 +389,10 @@ static void postgresGetForeignUpperPaths(PlannerInfo *root,
 							 RelOptInfo *input_rel,
 							 RelOptInfo *output_rel,
 							 void *extra);
+static bool postgresIsForeignPathAsyncCapable(ForeignPath *path);
+static bool postgresForeignAsyncConfigureWait(ForeignScanState *node,
+											  WaitEventSet *wes,
+											  void *caller_data, bool reinit);
 
 /*
  * Helper functions
@@ -378,7 +413,9 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn, bool clear_queue);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static PgFdwModifyState *create_foreign_modify(EState *estate,
 					  RangeTblEntry *rte,
@@ -469,6 +506,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -505,6 +543,10 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for upper relation push-down */
 	routine->GetForeignUpperPaths = postgresGetForeignUpperPaths;
 
+	/* Support functions for async execution */
+	routine->IsForeignPathAsyncCapable = postgresIsForeignPathAsyncCapable;
+	routine->ForeignAsyncConfigureWait = postgresForeignAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1355,12 +1397,22 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
+	fsstate->s.connpriv->leader = NULL;
+	fsstate->s.connpriv->busy = false;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->run_async = false;
+	fsstate->inqueue = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1408,40 +1460,258 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 							 &fsstate->param_values);
 }
 
+/*
+ * Async queue manipuration functions
+ */
+
+/*
+ * add_async_waiter:
+ *
+ * adds the node to the end of waiter queue. Immediately starts the node if no
+ * node is running
+ */
+static inline void
+add_async_waiter(ForeignScanState *node)
+{
+	PgFdwScanState   *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *leader = fsstate->s.connpriv->leader;
+
+	/* do nothing if the node is already in the queue or already eof'ed */
+	if (leader == node || fsstate->inqueue || fsstate->eof_reached)
+		return;
+
+	if (leader == NULL)
+	{
+		/* immediately send request if not busy */
+		request_more_data(node);
+	}
+	else
+	{
+		PgFdwScanState   *leader_state = GetPgFdwScanState(leader);
+		PgFdwScanState   *last_waiter_state
+			= GetPgFdwScanState(leader_state->last_waiter);
+
+		last_waiter_state->waiter = node;
+		leader_state->last_waiter = node;
+		fsstate->inqueue = true;
+	}
+}
+
+/*
+ * move_to_next_waiter:
+ *
+ * Makes the first waiter be next leader
+ * Returns the new leader or NULL if there's no waiter.
+ */
+static inline ForeignScanState *
+move_to_next_waiter(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *ret = fsstate->waiter;
+
+	Assert(fsstate->s.connpriv->leader = node);
+	
+	if (ret)
+	{
+		PgFdwScanState *retstate = GetPgFdwScanState(ret);
+		fsstate->waiter = NULL;
+		retstate->last_waiter = fsstate->last_waiter;
+		retstate->inqueue = false;
+	}
+
+	fsstate->s.connpriv->leader = ret;
+
+	return ret;
+}
+
+/*
+ * remove the node from waiter queue
+ *
+ * This is a bit different from the two above in the sense that this can
+ * operate on connection leader. The result is absorbed when this is called on
+ * active leader.
+ *
+ * Returns true if the node was found.
+ */
+static inline bool
+remove_async_node(ForeignScanState *node)
+{
+	PgFdwScanState		*fsstate = GetPgFdwScanState(node);
+	ForeignScanState	*leader = fsstate->s.connpriv->leader;
+	PgFdwScanState		*leader_state;
+	ForeignScanState	*prev;
+	PgFdwScanState		*prev_state;
+	ForeignScanState	*cur;
+
+	/* no need to remove me */
+	if (!leader || !fsstate->inqueue)
+		return false;
+
+	leader_state = GetPgFdwScanState(leader);
+
+	/* Remove the leader node */
+	if (leader == node)
+	{
+		ForeignScanState	*next_leader;
+
+		if (leader_state->s.connpriv->busy)
+		{
+			/*
+			 * this node is waiting for result, absorb the result first so
+			 * that the following commands can be sent on the connection.
+			 */
+			PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+			PGconn *conn = leader_state->s.conn;
+
+			while(PQisBusy(conn))
+				PQclear(PQgetResult(conn));
+			
+			leader_state->s.connpriv->busy = false;
+		}
+
+		/* Make the first waiter the leader */
+		if (leader_state->waiter)
+		{
+			PgFdwScanState *next_leader_state;
+
+			next_leader = leader_state->waiter;
+			next_leader_state = GetPgFdwScanState(next_leader);
+
+			leader_state->s.connpriv->leader = next_leader;
+			next_leader_state->last_waiter = leader_state->last_waiter;
+		}
+		leader_state->waiter = NULL;
+
+		return true;
+	}
+
+	/*
+	 * Just remove the node in queue
+	 *
+	 * This function is called on the shutdown path. We don't bother
+	 * considering faster way to do this.
+	 */
+	prev = leader;
+	prev_state = leader_state;
+	cur =  GetPgFdwScanState(prev)->waiter;
+	while (cur)
+	{
+		PgFdwScanState *curstate = GetPgFdwScanState(cur);
+
+		if (cur == node)
+		{
+			prev_state->waiter = curstate->waiter;
+			if (leader_state->last_waiter == cur)
+				leader_state->last_waiter = prev;
+			else
+				leader_state->last_waiter = cur;
+
+			fsstate->inqueue = false;
+
+			return true;
+		}
+		prev = cur;
+		prev_state = curstate;
+		cur = curstate->waiter;
+	}
+
+	return false;
+}
+
 /*
  * postgresIterateForeignScan
- *		Retrieve next row from the result set, or clear tuple slot to indicate
- *		EOF.
+ *		Retrieve next row from the result set.
+ *
+ *		For synchronous nodes, returns clear tuples slot to indicte EOF.
+ *
+ *		If the node is asynchronous one, clear tuple slot has two meanings.
+ *		If the caller receives clear tuple slot, asyncstate indicates wheter
+ *		the node is EOF (AS_AVAILABLE) or waiting for data to
+ *		come(AS_WAITING).
  */
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
-	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
+	if (fsstate->next_tuple >= fsstate->num_tuples && !fsstate->eof_reached)
+	{
+		/* we've run out, get some more tuples */
+		if (!node->fs_async)
+		{
+			/* finish running query to send my command */
+			if (!fsstate->s.connpriv->busy)
+				vacate_connection((PgFdwState *)fsstate, false);
+
+			request_more_data(node);
+
+			/*
+			 * Fetch the result immediately. This executes the next waiter if
+			 * any.
+			 */
+			fetch_received_data(node);
+		}
+        else if (!fsstate->s.connpriv->busy)
+		{
+			/* If the connection is not busy, just send the request. */
+			request_more_data(node);
+		}
+        else
+		{
+			/* This connection is busy */
+			bool available = true;
+			ForeignScanState *leader = fsstate->s.connpriv->leader;
+			PgFdwScanState *leader_state = GetPgFdwScanState(leader);
+
+			/* Check if the result is immediately available */
+			if (PQisBusy(leader_state->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(leader_state->s.conn), 0,
+										   WAIT_EVENT_ASYNC_WAIT);
+				if (!(rc & WL_SOCKET_READABLE))
+					available = false;
+			}
+
+			/* The next waiter is executed automatcically */
+			if (available)
+				fetch_received_data(leader);
+
+			/* add the requested node */
+			add_async_waiter(node);
+
+			/* add the previous leader */
+			add_async_waiter(leader);
+		}
+	}
 
 	/*
-	 * Get some more tuples, if we've run out.
+	 * If we haven't received a result for the given node this time,
+	 * return with no tuple to give way to another node.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
-		if (fsstate->next_tuple >= fsstate->num_tuples)
-			return ExecClearTuple(slot);
+		if (fsstate->eof_reached)
+		{
+			fsstate->result_ready = true;
+			node->ss.ps.asyncstate = AS_AVAILABLE;
+		}
+		else
+		{
+			fsstate->result_ready = false;
+			node->ss.ps.asyncstate = AS_WAITING;
+		}
+			
+		return ExecClearTuple(slot);
 	}
 
 	/*
 	 * Return the next tuple.
 	 */
+	fsstate->result_ready = true;
+	node->ss.ps.asyncstate = AS_AVAILABLE;
 	ExecStoreTuple(fsstate->tuples[fsstate->next_tuple++],
 				   slot,
 				   InvalidBuffer,
@@ -1457,7 +1727,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1465,6 +1735,8 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	vacate_connection((PgFdwState *)fsstate, true);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1493,9 +1765,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1513,7 +1785,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1521,15 +1793,31 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
+/*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	/* remove the node from waiting queue */
+	remove_async_node(node);
+}
+
 /*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
@@ -1753,6 +2041,9 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	/* finish running query to send my command */
+	vacate_connection((PgFdwState *)fmstate, true);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1763,14 +2054,14 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1778,10 +2069,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1819,6 +2110,9 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	/* finish running query to send my command */
+	vacate_connection((PgFdwState *)fmstate, true);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1839,14 +2133,14 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1854,10 +2148,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1895,6 +2189,9 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	/* finish running query to send my command */
+	vacate_connection((PgFdwState *)fmstate, true);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1915,14 +2212,14 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1930,10 +2227,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -2400,7 +2697,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 
 	/* Update the foreign-join-related fields. */
 	if (fsplan->scan.scanrelid == 0)
@@ -2485,7 +2784,11 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		/* finish running query to send my command */
+		vacate_connection((PgFdwState *)dmstate, true);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2532,8 +2835,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* close the target relation. */
 	if (dmstate->resultRel)
@@ -2656,6 +2959,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnpriv *connpriv;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2698,6 +3002,18 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connpriv = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnpriv));
+		if (connpriv)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connpriv = connpriv;
+
+			/* finish running query to send my command */
+			vacate_connection(&tmpstate, true);
+		}
+
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -3061,11 +3377,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -3128,50 +3444,127 @@ create_cursor(ForeignScanState *node)
 }
 
 /*
- * Fetch some more rows from the node's cursor.
+ * Sends the next request of the node. If the given node is different from the
+ * current connection leader, pushes it back to waiter queue and let the given
+ * node be the leader.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *leader = fsstate->s.connpriv->leader;
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* must be non-busy */
+	Assert(!fsstate->s.connpriv->busy);
+	/* must be not-eof */
+	Assert(!fsstate->eof_reached);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connpriv->busy = true;
+
+	/* Let the node be the leader if it is different from current one */
+	if (leader != node)
+	{
+		/*
+		 * If the connection leader exists, insert the node as the connection
+		 * leader making the current leader be the first waiter.
+		 */
+		if (leader != NULL)
+		{
+			remove_async_node(node);
+			fsstate->last_waiter = GetPgFdwScanState(leader)->last_waiter;
+			fsstate->waiter = leader;
+		}
+		else
+		{
+			fsstate->last_waiter = node;
+			fsstate->waiter = NULL;
+		}
+
+		fsstate->s.connpriv->leader = node;
+	}
+}
+
+/*
+ * Fetches received data and automatically send requests of the next waiter.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
+	ForeignScanState *waiter;
+
+	/* I should be the current connection leader */
+	Assert(fsstate->s.connpriv->leader == node);
 
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -3181,26 +3574,76 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connpriv->busy = false;
+
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connpriv->busy = false;
+
+	/* let the first waiter be the next leader of this connection */
+	waiter = move_to_next_waiter(node);
+
+	/* send the next request if any */
+	if (waiter)
+		request_more_data(waiter);
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
+/*
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate, bool clear_queue)
+{
+	PgFdwConnpriv *connpriv = fdwstate->connpriv;
+	ForeignScanState *leader;
+
+	/* the connection is alrady available */
+	if (connpriv == NULL || connpriv->leader == NULL || !connpriv->busy)
+		return;
+
+	/*
+	 * let the current connection leader read the result for the running query
+	 */
+	leader = connpriv->leader;
+	fetch_received_data(leader);
+
+	/* let the first waiter be the next leader of this connection */
+	move_to_next_waiter(leader);
+
+	if (!clear_queue)
+		return;
+
+	/* Clear the waiting list */
+	while (leader)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(leader);
+
+		fsstate->last_waiter = NULL;
+		leader = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3314,7 +3757,9 @@ create_foreign_modify(EState *estate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connpriv = (PgFdwConnpriv *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnpriv));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Set up remote query information. */
@@ -3387,7 +3832,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3397,12 +3842,12 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn, false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3410,9 +3855,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3537,16 +3982,16 @@ finish_foreign_modify(PgFdwModifyState *fmstate)
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -3706,9 +4151,9 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn, false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3716,10 +4161,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -5203,6 +5648,42 @@ postgresGetForeignJoinPaths(PlannerInfo *root,
 	/* XXX Consider parameterized paths for the join relation */
 }
 
+static bool
+postgresIsForeignPathAsyncCapable(ForeignPath *path)
+{
+	return true;
+}
+
+
+/*
+ * Configure waiting event.
+ *
+ * Add an wait event only when the node is the connection leader. Elsewise
+ * another node on this connection is the leader.
+ */
+static bool
+postgresForeignAsyncConfigureWait(ForeignScanState *node, WaitEventSet *wes,
+								  void *caller_data, bool reinit)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+
+	/* If the caller didn't reinit, this event is already in event set */
+	if (!reinit)
+		return true;
+
+	if (fsstate->s.connpriv->leader == node)
+	{
+		AddWaitEventToSet(wes,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, caller_data);
+		return true;
+	}
+
+	return false;
+}
+
+
 /*
  * Assess whether the aggregation, grouping and having operations can be pushed
  * down to the foreign server.  As a side effect, save information we obtain in
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index a5d4011e8d..f344fb7f66 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -77,6 +77,7 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
 
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
@@ -116,6 +117,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 231b1e01a5..8ecc903c20 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1617,25 +1617,25 @@ INSERT INTO b(aa) VALUES('bbb');
 INSERT INTO b(aa) VALUES('bbbb');
 INSERT INTO b(aa) VALUES('bbbbb');
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'zzzzzz' WHERE aa LIKE 'aaaa%';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE b SET aa = 'new';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
 UPDATE a SET aa = 'newtoo';
 
-SELECT tableoid::regclass, * FROM a;
+SELECT tableoid::regclass, * FROM a ORDER BY 1, 2;
 SELECT tableoid::regclass, * FROM b;
 SELECT tableoid::regclass, * FROM ONLY a;
 
@@ -1677,12 +1677,12 @@ insert into bar2 values(4,44,44);
 insert into bar2 values(7,77,77);
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for update;
-select * from bar where f1 in (select f1 from foo) for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
+select * from bar where f1 in (select f1 from foo) order by 1 for update;
 
 explain (verbose, costs off)
-select * from bar where f1 in (select f1 from foo) for share;
-select * from bar where f1 in (select f1 from foo) for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
+select * from bar where f1 in (select f1 from foo) order by 1 for share;
 
 -- Check UPDATE with inherited target and an inherited source table
 explain (verbose, costs off)
@@ -1741,8 +1741,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 -- Test that UPDATE/DELETE with inherited target works with row-level triggers
 CREATE TRIGGER trig_row_before
-- 
2.16.3