asynchronous and vectorized execution

Started by Robert Haasover 9 years ago64 messages
#1Robert Haas
robertmhaas@gmail.com
3 attachment(s)

Hi,

I realize that we haven't gotten 9.6beta1 out the door yet, but I
think we can't really wait much longer to start having at least some
discussion of 9.7 topics, so I'm going to go ahead and put this one
out there. I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who). To
make a long story short, I think there are several different areas
where we should consider major upgrades to our executor. It's too
slow and it doesn't do everything we want it to do. The main things
on my mind are:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. This case most obviously arises for foreign
tables, where it makes little sense to block on I/O if some other part
of the query tree could benefit from the CPU; consider SELECT * FROM
lt WHERE qual UNION SELECT * FROM ft WHERE qual. It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree. For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on. What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat. If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse. But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple. If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more. Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here. We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Obviously, both of these are big projects that could touch a large
amount of executor code, and there may be other ideas, in addition to
these, which some of you may be thinking about that could also touch a
large amount of executor code. It would be nice to agree on a way
forward that minimizes code churn and maximizes everyone's attempt to
contribute without conflicting with each other. Also, it seems
desirable to enable, as far as possible, incremental development - in
particular, it seems to me that it would be good to pick a design that
doesn't require massive changes to every node all at once. A single
patch that adds some capability to every node in the executor in one
fell swoop is going to be too large to review effectively.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple. Similarly, for vectorized execution,
a node might return a bunch of tuples all at once. ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted. In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today. Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities. They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it. If not, ExecProcNode will paper over the problem. This
seems to me to be a good way to keep the code simple.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them. There is also a result_ready boolean, so
that a node can return without setting this Boolean to engage
asynchronous behavior. This triggers an "event loop", which
repeatedly waits for FDs chosen by waiting nodes to become readable
and/or writeable and then gives the node a chance to react.
Eventually, the waiting node will stop waiting and have a result
ready, at which point the event loop will give the parent of that node
a chance to run. If that node consequently becomes ready, then its
parent gets a chance to run. Eventually (we hope), the node for which
we're waiting becomes ready, and we can then read a result tuple.
With some more work, this seems like it can handle the FDW case, but I
haven't worked out how to make it handle the related parallel query
case. What we want there is to wait not for the readiness of an FD
but rather for some other process involved in the parallel query to
reach a point where it can welcome assistance executing that node. I
don't know exactly what the signaling for that should look like yet -
maybe setting the process latch or something.

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
/*
* Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
* (and this is very bad) - so, here we do not check are keys ok or not.
*/
return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster. Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Thoughts, ideas, suggestions, etc. very welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patchtext/x-diff; charset=US-ASCII; name=0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patchDownload
From 905bb2c9a9e025f7cd0b5bd75e735f6e8f69f3cf Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 4 May 2016 12:19:03 -0400
Subject: [PATCH 1/3] Modify PlanState to include a pointer to the parent
 PlanState.

---
 src/backend/executor/execMain.c           | 22 ++++++++++++++--------
 src/backend/executor/execProcnode.c       |  5 ++++-
 src/backend/executor/nodeAgg.c            |  3 ++-
 src/backend/executor/nodeAppend.c         |  3 ++-
 src/backend/executor/nodeBitmapAnd.c      |  3 ++-
 src/backend/executor/nodeBitmapHeapscan.c |  3 ++-
 src/backend/executor/nodeBitmapOr.c       |  3 ++-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeGather.c         |  3 ++-
 src/backend/executor/nodeGroup.c          |  3 ++-
 src/backend/executor/nodeHash.c           |  3 ++-
 src/backend/executor/nodeHashjoin.c       |  6 ++++--
 src/backend/executor/nodeLimit.c          |  3 ++-
 src/backend/executor/nodeLockRows.c       |  3 ++-
 src/backend/executor/nodeMaterial.c       |  3 ++-
 src/backend/executor/nodeMergeAppend.c    |  3 ++-
 src/backend/executor/nodeMergejoin.c      |  4 +++-
 src/backend/executor/nodeModifyTable.c    |  3 ++-
 src/backend/executor/nodeNestloop.c       |  6 ++++--
 src/backend/executor/nodeRecursiveunion.c |  6 ++++--
 src/backend/executor/nodeResult.c         |  3 ++-
 src/backend/executor/nodeSetOp.c          |  3 ++-
 src/backend/executor/nodeSort.c           |  3 ++-
 src/backend/executor/nodeSubplan.c        |  1 +
 src/backend/executor/nodeSubqueryscan.c   |  3 ++-
 src/backend/executor/nodeUnique.c         |  3 ++-
 src/backend/executor/nodeWindowAgg.c      |  3 ++-
 src/include/executor/executor.h           |  3 ++-
 src/include/nodes/execnodes.h             |  2 ++
 29 files changed, 77 insertions(+), 37 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ac02304..e0d0296 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -923,7 +923,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
-	 * ExecInitSubPlan expects to be able to find these entries.
+	 * ExecInitSubPlan expects to be able to find these entries. Since the
+	 * main plan tree hasn't been initialized yet, we have to pass NULL as the
+	 * parent node to ExecInitNode; ExecInitSubPlan also takes responsibility
+	 * for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	i = 1;						/* subplan indices count from 1 */
@@ -943,7 +946,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
-		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
+		subplanstate = ExecInitNode(subplan, estate, NULL, sp_eflags);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
@@ -954,9 +957,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize the private state information for all the nodes in the query
 	 * tree.  This opens files, allocates storage and leaves us ready to start
-	 * processing tuples.
+	 * processing tuples.  This is the root planstate node; it has no parent.
 	 */
-	planstate = ExecInitNode(plan, estate, eflags);
+	planstate = ExecInitNode(plan, estate, NULL, eflags);
 
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
@@ -2841,7 +2844,9 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * ExecInitSubPlan expects to be able to find these entries. Some of the
 	 * SubPlans might not be used in the part of the plan tree we intend to
 	 * run, but since it's not easy to tell which, we just initialize them
-	 * all.
+	 * all.  Since the main plan tree hasn't been initialized yet, we have to
+	 * pass NULL as the parent node to ExecInitNode; ExecInitSubPlan also
+	 * takes responsibility for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	foreach(l, parentestate->es_plannedstmt->subplans)
@@ -2849,7 +2854,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;
 
-		subplanstate = ExecInitNode(subplan, estate, 0);
+		subplanstate = ExecInitNode(subplan, estate, NULL, 0);
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
 	}
@@ -2857,9 +2862,10 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	/*
 	 * Initialize the private state information for all the nodes in the part
 	 * of the plan tree we need to run.  This opens files, allocates storage
-	 * and leaves us ready to start processing tuples.
+	 * and leaves us ready to start processing tuples.  This is the root plan
+	 * node; it has no parent.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->planstate = ExecInitNode(planTree, estate, NULL, 0);
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..680ca4b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -133,7 +133,7 @@
  * ------------------------------------------------------------------------
  */
 PlanState *
-ExecInitNode(Plan *node, EState *estate, int eflags)
+ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 {
 	PlanState  *result;
 	List	   *subps;
@@ -340,6 +340,9 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 			break;
 	}
 
+	/* Set parent pointer. */
+	result->parent = parent;
+
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
 	 * a separate list for us.
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 0c1e4a3..e37551e 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2448,7 +2448,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(aggstate) =
+		ExecInitNode(outerPlan, estate, &aggstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type.
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..beb4ab8 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -165,7 +165,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, &appendstate->ps,
+										   eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index c39d790..6405fa4 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -81,7 +81,8 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmapandstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..2ba5cd0 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -646,7 +646,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * relation's indexes, and we want to be sure we have acquired a lock on
 	 * the relation first.
 	 */
-	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate,
+											 &scanstate->ss.ps, eflags);
 
 	/*
 	 * all done.
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 7e928eb..faa3a37 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -82,7 +82,8 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmaporstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 300f947..8418c5a 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -224,7 +224,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	/* Initialize any outer plan. */
 	if (outerPlan(node))
 		outerPlanState(scanstate) =
-			ExecInitNode(outerPlan(node), estate, eflags);
+			ExecInitNode(outerPlan(node), estate, &scanstate->ss.ps, eflags);
 
 	/*
 	 * Tell the FDW to initialize the scan.
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 3834ed6..2ac0c8d 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -97,7 +97,8 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gatherstate) =
+		ExecInitNode(outerNode, estate, &gatherstate->ps, eflags);
 
 	gatherstate->ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index dcf5175..3c066fc 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -233,7 +233,8 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(grpstate) =
+		ExecInitNode(outerPlan(node), estate, &grpstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 9ed09a7..5e78de0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -200,7 +200,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(hashstate) =
+		ExecInitNode(outerPlan(node), estate, &hashstate->ps, eflags);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 369e666..a7a908a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -486,8 +486,10 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	outerNode = outerPlan(node);
 	hashNode = (Hash *) innerPlan(node);
 
-	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags);
-	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
+	outerPlanState(hjstate) =
+		ExecInitNode(outerNode, estate, &hjstate->js.ps, eflags);
+	innerPlanState(hjstate) =
+		ExecInitNode((Plan *) hashNode, estate, &hjstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index faf32e1..97267c5 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -412,7 +412,8 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(limitstate) =
+		ExecInitNode(outerPlan, estate, &limitstate->ps, eflags);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 4ebcaff..c4b5333 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -376,7 +376,8 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(lrstate) =
+		ExecInitNode(outerPlan, estate, &lrstate->ps, eflags);
 
 	/*
 	 * LockRows nodes do no projections, so initialize projection info for
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9ab03f3..82e31c1 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -219,7 +219,8 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
 	outerPlan = outerPlan(node);
-	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(matstate) =
+		ExecInitNode(outerPlan, estate, &matstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e271927..ae0e8dc 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -112,7 +112,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		mergeplanstates[i] =
+			ExecInitNode(initNode, estate, &mergestate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 6db09b8..cd8d6c6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1522,8 +1522,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 *
 	 * inner child must support MARK/RESTORE.
 	 */
-	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(mergestate) =
+		ExecInitNode(outerPlan(node), estate, &mergestate->js.ps, eflags);
 	innerPlanState(mergestate) = ExecInitNode(innerPlan(node), estate,
+											  &mergestate->js.ps,
 											  eflags | EXEC_FLAG_MARK);
 
 	/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e62c8aa..7bb318a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1618,7 +1618,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
-		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
+		mtstate->mt_plans[i] =
+			ExecInitNode(subplan, estate, &mtstate->ps, eflags);
 
 		/* Also let FDWs init themselves for foreign-table result rels */
 		if (!resultRelInfo->ri_usesFdwDirectModify &&
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 555fa09..1895b60 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -340,12 +340,14 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
-	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(nlstate) =
+		ExecInitNode(outerPlan(node), estate, &nlstate->js.ps, eflags);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
-	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
+	innerPlanState(nlstate) =
+		ExecInitNode(innerPlan(node), estate, &nlstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index e76405a..2328ef3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -245,8 +245,10 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags);
-	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags);
+	outerPlanState(rustate) =
+		ExecInitNode(outerPlan(node), estate, &rustate->ps, eflags);
+	innerPlanState(rustate) =
+		ExecInitNode(innerPlan(node), estate, &rustate->ps, eflags);
 
 	/*
 	 * If hashing, precompute fmgr lookup data for inner loop, and create the
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 4007b76..0d2de14 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -250,7 +250,8 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(resstate) =
+		ExecInitNode(outerPlan(node), estate, &resstate->ps, eflags);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 2d81d46..7a3b67c 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -537,7 +537,8 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
-	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(setopstate) =
+		ExecInitNode(outerPlan(node), estate, &setopstate->ps, eflags);
 
 	/*
 	 * setop nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a34dcc5..0286a7f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -199,7 +199,8 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(sortstate) =
+		ExecInitNode(outerPlan(node), estate, &sortstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index e503494..458e254 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -707,6 +707,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 
 	/* ... and to its parent's state */
 	sstate->parent = parent;
+	sstate->planstate->parent = parent;
 
 	/* Initialize subexpressions */
 	sstate->testexpr = ExecInitExpr((Expr *) subplan->testexpr, parent);
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 0304b15..75a28fd 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -144,7 +144,8 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	/*
 	 * initialize subquery
 	 */
-	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags);
+	subquerystate->subplan =
+		ExecInitNode(node->subplan, estate, &subquerystate->ss.ps, eflags);
 
 	subquerystate->ss.ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 4caae34..5d13a89 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -145,7 +145,8 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(uniquestate) =
+		ExecInitNode(outerPlan(node), estate, &uniquestate->ps, eflags);
 
 	/*
 	 * unique nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f06eebe..3dc6757 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1844,7 +1844,8 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(winstate) =
+		ExecInitNode(outerPlan, estate, &winstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type (which is also the tuple type that we'll
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 44fac27..f1be8fa 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -221,7 +221,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 /*
  * prototypes from functions in execProcnode.c
  */
-extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
+extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
+			 int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ee4e189..7d33b6d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1030,6 +1030,8 @@ typedef struct PlanState
 								 * nodes point to one EState for the whole
 								 * top-level plan */
 
+	struct PlanState *parent;	/* node which will receive tuples from us */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
 
-- 
2.5.4 (Apple Git-61)

0002-Modify-PlanState-to-have-result-result_ready-fields.patchtext/x-diff; charset=US-ASCII; name=0002-Modify-PlanState-to-have-result-result_ready-fields.patchDownload
From 1c8d3fa278824ba83043aa106d35419f6824ab3f Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 6 May 2016 13:01:48 -0400
Subject: [PATCH 2/3] Modify PlanState to have result/result_ready fields.
 Modify executor to use them instead of returning tuples directly.

---
 src/backend/executor/execProcnode.c       | 75 ++++++++++++++++++-------------
 src/backend/executor/execScan.c           | 26 +++++++----
 src/backend/executor/nodeAgg.c            | 13 +++---
 src/backend/executor/nodeAppend.c         | 11 +++--
 src/backend/executor/nodeBitmapHeapscan.c |  2 +-
 src/backend/executor/nodeCtescan.c        |  2 +-
 src/backend/executor/nodeCustom.c         |  4 +-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeFunctionscan.c   |  2 +-
 src/backend/executor/nodeGather.c         | 17 ++++---
 src/backend/executor/nodeGroup.c          | 24 +++++++---
 src/backend/executor/nodeHash.c           |  3 +-
 src/backend/executor/nodeHashjoin.c       | 29 ++++++++----
 src/backend/executor/nodeIndexonlyscan.c  |  2 +-
 src/backend/executor/nodeIndexscan.c      |  2 +-
 src/backend/executor/nodeLimit.c          | 42 ++++++++++++-----
 src/backend/executor/nodeLockRows.c       |  9 ++--
 src/backend/executor/nodeMaterial.c       | 21 ++++++---
 src/backend/executor/nodeMergeAppend.c    |  4 +-
 src/backend/executor/nodeMergejoin.c      | 74 ++++++++++++++++++++++--------
 src/backend/executor/nodeModifyTable.c    | 15 ++++---
 src/backend/executor/nodeNestloop.c       | 16 ++++---
 src/backend/executor/nodeRecursiveunion.c | 10 +++--
 src/backend/executor/nodeResult.c         | 20 ++++++---
 src/backend/executor/nodeSamplescan.c     |  2 +-
 src/backend/executor/nodeSeqscan.c        |  2 +-
 src/backend/executor/nodeSetOp.c          | 14 +++---
 src/backend/executor/nodeSort.c           |  4 +-
 src/backend/executor/nodeSubqueryscan.c   |  2 +-
 src/backend/executor/nodeTidscan.c        |  2 +-
 src/backend/executor/nodeUnique.c         |  8 ++--
 src/backend/executor/nodeValuesscan.c     |  2 +-
 src/backend/executor/nodeWindowAgg.c      | 17 ++++---
 src/backend/executor/nodeWorktablescan.c  |  2 +-
 src/include/executor/executor.h           | 11 ++++-
 src/include/executor/nodeAgg.h            |  2 +-
 src/include/executor/nodeAppend.h         |  2 +-
 src/include/executor/nodeBitmapHeapscan.h |  2 +-
 src/include/executor/nodeCtescan.h        |  2 +-
 src/include/executor/nodeCustom.h         |  2 +-
 src/include/executor/nodeForeignscan.h    |  2 +-
 src/include/executor/nodeFunctionscan.h   |  2 +-
 src/include/executor/nodeGather.h         |  2 +-
 src/include/executor/nodeGroup.h          |  2 +-
 src/include/executor/nodeHash.h           |  2 +-
 src/include/executor/nodeHashjoin.h       |  2 +-
 src/include/executor/nodeIndexonlyscan.h  |  2 +-
 src/include/executor/nodeIndexscan.h      |  2 +-
 src/include/executor/nodeLimit.h          |  2 +-
 src/include/executor/nodeLockRows.h       |  2 +-
 src/include/executor/nodeMaterial.h       |  2 +-
 src/include/executor/nodeMergeAppend.h    |  2 +-
 src/include/executor/nodeMergejoin.h      |  2 +-
 src/include/executor/nodeModifyTable.h    |  2 +-
 src/include/executor/nodeNestloop.h       |  2 +-
 src/include/executor/nodeRecursiveunion.h |  2 +-
 src/include/executor/nodeResult.h         |  2 +-
 src/include/executor/nodeSamplescan.h     |  2 +-
 src/include/executor/nodeSeqscan.h        |  2 +-
 src/include/executor/nodeSetOp.h          |  2 +-
 src/include/executor/nodeSort.h           |  2 +-
 src/include/executor/nodeSubqueryscan.h   |  2 +-
 src/include/executor/nodeTidscan.h        |  2 +-
 src/include/executor/nodeUnique.h         |  2 +-
 src/include/executor/nodeValuesscan.h     |  2 +-
 src/include/executor/nodeWindowAgg.h      |  2 +-
 src/include/executor/nodeWorktablescan.h  |  2 +-
 src/include/nodes/execnodes.h             |  2 +
 68 files changed, 360 insertions(+), 197 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 680ca4b..3f2ebff 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -380,6 +380,9 @@ ExecProcNode(PlanState *node)
 
 	CHECK_FOR_INTERRUPTS();
 
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
@@ -392,23 +395,23 @@ ExecProcNode(PlanState *node)
 			 * control nodes
 			 */
 		case T_ResultState:
-			result = ExecResult((ResultState *) node);
+			ExecResult((ResultState *) node);
 			break;
 
 		case T_ModifyTableState:
-			result = ExecModifyTable((ModifyTableState *) node);
+			ExecModifyTable((ModifyTableState *) node);
 			break;
 
 		case T_AppendState:
-			result = ExecAppend((AppendState *) node);
+			ExecAppend((AppendState *) node);
 			break;
 
 		case T_MergeAppendState:
-			result = ExecMergeAppend((MergeAppendState *) node);
+			ExecMergeAppend((MergeAppendState *) node);
 			break;
 
 		case T_RecursiveUnionState:
-			result = ExecRecursiveUnion((RecursiveUnionState *) node);
+			ExecRecursiveUnion((RecursiveUnionState *) node);
 			break;
 
 			/* BitmapAndState does not yield tuples */
@@ -419,119 +422,119 @@ ExecProcNode(PlanState *node)
 			 * scan nodes
 			 */
 		case T_SeqScanState:
-			result = ExecSeqScan((SeqScanState *) node);
+			ExecSeqScan((SeqScanState *) node);
 			break;
 
 		case T_SampleScanState:
-			result = ExecSampleScan((SampleScanState *) node);
+			ExecSampleScan((SampleScanState *) node);
 			break;
 
 		case T_IndexScanState:
-			result = ExecIndexScan((IndexScanState *) node);
+			ExecIndexScan((IndexScanState *) node);
 			break;
 
 		case T_IndexOnlyScanState:
-			result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
+			ExecIndexOnlyScan((IndexOnlyScanState *) node);
 			break;
 
 			/* BitmapIndexScanState does not yield tuples */
 
 		case T_BitmapHeapScanState:
-			result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
+			ExecBitmapHeapScan((BitmapHeapScanState *) node);
 			break;
 
 		case T_TidScanState:
-			result = ExecTidScan((TidScanState *) node);
+			ExecTidScan((TidScanState *) node);
 			break;
 
 		case T_SubqueryScanState:
-			result = ExecSubqueryScan((SubqueryScanState *) node);
+			ExecSubqueryScan((SubqueryScanState *) node);
 			break;
 
 		case T_FunctionScanState:
-			result = ExecFunctionScan((FunctionScanState *) node);
+			ExecFunctionScan((FunctionScanState *) node);
 			break;
 
 		case T_ValuesScanState:
-			result = ExecValuesScan((ValuesScanState *) node);
+			ExecValuesScan((ValuesScanState *) node);
 			break;
 
 		case T_CteScanState:
-			result = ExecCteScan((CteScanState *) node);
+			ExecCteScan((CteScanState *) node);
 			break;
 
 		case T_WorkTableScanState:
-			result = ExecWorkTableScan((WorkTableScanState *) node);
+			ExecWorkTableScan((WorkTableScanState *) node);
 			break;
 
 		case T_ForeignScanState:
-			result = ExecForeignScan((ForeignScanState *) node);
+			ExecForeignScan((ForeignScanState *) node);
 			break;
 
 		case T_CustomScanState:
-			result = ExecCustomScan((CustomScanState *) node);
+			ExecCustomScan((CustomScanState *) node);
 			break;
 
 			/*
 			 * join nodes
 			 */
 		case T_NestLoopState:
-			result = ExecNestLoop((NestLoopState *) node);
+			ExecNestLoop((NestLoopState *) node);
 			break;
 
 		case T_MergeJoinState:
-			result = ExecMergeJoin((MergeJoinState *) node);
+			ExecMergeJoin((MergeJoinState *) node);
 			break;
 
 		case T_HashJoinState:
-			result = ExecHashJoin((HashJoinState *) node);
+			ExecHashJoin((HashJoinState *) node);
 			break;
 
 			/*
 			 * materialization nodes
 			 */
 		case T_MaterialState:
-			result = ExecMaterial((MaterialState *) node);
+			ExecMaterial((MaterialState *) node);
 			break;
 
 		case T_SortState:
-			result = ExecSort((SortState *) node);
+			ExecSort((SortState *) node);
 			break;
 
 		case T_GroupState:
-			result = ExecGroup((GroupState *) node);
+			ExecGroup((GroupState *) node);
 			break;
 
 		case T_AggState:
-			result = ExecAgg((AggState *) node);
+			ExecAgg((AggState *) node);
 			break;
 
 		case T_WindowAggState:
-			result = ExecWindowAgg((WindowAggState *) node);
+			ExecWindowAgg((WindowAggState *) node);
 			break;
 
 		case T_UniqueState:
-			result = ExecUnique((UniqueState *) node);
+			ExecUnique((UniqueState *) node);
 			break;
 
 		case T_GatherState:
-			result = ExecGather((GatherState *) node);
+			ExecGather((GatherState *) node);
 			break;
 
 		case T_HashState:
-			result = ExecHash((HashState *) node);
+			ExecHash((HashState *) node);
 			break;
 
 		case T_SetOpState:
-			result = ExecSetOp((SetOpState *) node);
+			ExecSetOp((SetOpState *) node);
 			break;
 
 		case T_LockRowsState:
-			result = ExecLockRows((LockRowsState *) node);
+			ExecLockRows((LockRowsState *) node);
 			break;
 
 		case T_LimitState:
-			result = ExecLimit((LimitState *) node);
+			ExecLimit((LimitState *) node);
 			break;
 
 		default:
@@ -540,6 +543,14 @@ ExecProcNode(PlanState *node)
 			break;
 	}
 
+	/* We don't support asynchronous execution yet. */
+	Assert(node->result_ready);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	result = (TupleTableSlot *) node->result;
+
 	if (node->instrument)
 		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index fb0013d..095d40b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -99,7 +99,7 @@ ExecScanFetch(ScanState *node,
  *		ExecScan
  *
  *		Scans the relation using the 'access method' indicated and
- *		returns the next qualifying tuple in the direction specified
+ *		produces the next qualifying tuple in the direction specified
  *		in the global variable ExecDirection.
  *		The access method returns the next tuple and ExecScan() is
  *		responsible for checking the tuple returned against the qual-clause.
@@ -117,7 +117,7 @@ ExecScanFetch(ScanState *node,
  *			 "cursor" is positioned before the first qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecScan(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
 		 ExecScanRecheckMtd recheckMtd)
@@ -137,12 +137,14 @@ ExecScan(ScanState *node,
 
 	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
-	 * all the overhead and return the raw scan tuple.
+	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		ExecReturnTuple(&node->ps,
+						ExecScanFetch(node, accessMtd, recheckMtd));
+		return;
 	}
 
 	/*
@@ -155,7 +157,10 @@ ExecScan(ScanState *node,
 		Assert(projInfo);		/* can't get here if not projecting */
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -188,9 +193,10 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				return ExecClearTuple(projInfo->pi_slot);
+				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
 			else
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		/*
@@ -221,7 +227,8 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					return resultSlot;
+					ExecReturnTuple(&node->ps, resultSlot);
+					return;
 				}
 			}
 			else
@@ -229,7 +236,8 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index e37551e..b23065d 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1816,7 +1816,7 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  *	  stored in the expression context to be used when ExecProject evaluates
  *	  the result tuple.
  */
-TupleTableSlot *
+void
 ExecAgg(AggState *node)
 {
 	TupleTableSlot *result;
@@ -1832,7 +1832,10 @@ ExecAgg(AggState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1842,6 +1845,7 @@ ExecAgg(AggState *node)
 	 * agg_done gets set before we emit the final aggregate tuple, and we have
 	 * to finish running SRFs for it.)
 	 */
+	result = NULL;
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -1856,12 +1860,9 @@ ExecAgg(AggState *node)
 				result = agg_retrieve_direct(node);
 				break;
 		}
-
-		if (!TupIsNull(result))
-			return result;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ss.ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index beb4ab8..e0ce8c6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -191,7 +191,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecAppend(AppendState *node)
 {
 	for (;;)
@@ -216,7 +216,8 @@ ExecAppend(AppendState *node)
 			 * NOT make use of the result slot that was set up in
 			 * ExecInitAppend; there's no need for it.
 			 */
-			return result;
+			ExecReturnTuple(&node->ps, result);
+			return;
 		}
 
 		/*
@@ -229,7 +230,11 @@ ExecAppend(AppendState *node)
 		else
 			node->as_whichplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			ExecReturnTuple(&node->ps,
+							ExecClearTuple(node->ps.ps_ResultTupleSlot));
+			return;
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2ba5cd0..31133ff 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -434,7 +434,7 @@ BitmapHeapRecheck(BitmapHeapScanState *node, TupleTableSlot *slot)
  *		ExecBitmapHeapScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecBitmapHeapScan(BitmapHeapScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 3c2f684..1f1fdf5 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -149,7 +149,7 @@ CteScanRecheck(CteScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecCteScan(CteScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 322abca..7162348 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -107,11 +107,11 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	return css;
 }
 
-TupleTableSlot *
+void
 ExecCustomScan(CustomScanState *node)
 {
 	Assert(node->methods->ExecCustomScan != NULL);
-	return node->methods->ExecCustomScan(node);
+	ExecReturnTuple(&node->ss.ps, node->methods->ExecCustomScan(node));
 }
 
 void
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 8418c5a..13f0c3a 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -113,7 +113,7 @@ ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecForeignScan(ForeignScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index a03f6e7..3cccd8f 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -262,7 +262,7 @@ FunctionRecheck(FunctionScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecFunctionScan(FunctionScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 2ac0c8d..a4d3a16 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -126,7 +126,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
  *		the next qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecGather(GatherState *node)
 {
 	TupleTableSlot *fslot = node->funnel_slot;
@@ -207,7 +207,10 @@ ExecGather(GatherState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -229,7 +232,10 @@ ExecGather(GatherState *node)
 		 */
 		slot = gather_getnext(node);
 		if (TupIsNull(slot))
-			return NULL;
+		{
+			ExecReturnTuple(&node->ps, NULL);
+			return;
+		}
 
 		/*
 		 * form the result tuple using ExecProject(), and return it --- unless
@@ -242,11 +248,12 @@ ExecGather(GatherState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 3c066fc..f33a316 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -31,7 +31,7 @@
  *
  *		Return one tuple for each group of matching input tuples.
  */
-TupleTableSlot *
+void
 ExecGroup(GroupState *node)
 {
 	ExprContext *econtext;
@@ -44,7 +44,10 @@ ExecGroup(GroupState *node)
 	 * get state info from node
 	 */
 	if (node->grp_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ss.ps, NULL);
+		return;
+	}
 	econtext = node->ss.ps.ps_ExprContext;
 	numCols = ((Group *) node->ss.ps.plan)->numCols;
 	grpColIdx = ((Group *) node->ss.ps.plan)->grpColIdx;
@@ -61,7 +64,10 @@ ExecGroup(GroupState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -87,7 +93,8 @@ ExecGroup(GroupState *node)
 		{
 			/* empty input, so return nothing */
 			node->grp_done = TRUE;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 		/* Copy tuple into firsttupleslot */
 		ExecCopySlot(firsttupleslot, outerslot);
@@ -115,7 +122,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
@@ -139,7 +147,8 @@ ExecGroup(GroupState *node)
 			{
 				/* no more groups, so we're done */
 				node->grp_done = TRUE;
-				return NULL;
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
 			}
 
 			/*
@@ -178,7 +187,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5e78de0..905eb30 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -56,11 +56,10 @@ static void *dense_alloc(HashJoinTable hashtable, Size size);
  *		stub for pro forma compliance
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecHash(HashState *node)
 {
 	elog(ERROR, "Hash node does not support ExecProcNode call convention");
-	return NULL;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index a7a908a..cc92fc3 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -58,7 +58,7 @@ static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecHashJoin(HashJoinState *node)
 {
 	PlanState  *outerNode;
@@ -93,7 +93,10 @@ ExecHashJoin(HashJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -155,7 +158,8 @@ ExecHashJoin(HashJoinState *node)
 					if (TupIsNull(node->hj_FirstOuterTupleSlot))
 					{
 						node->hj_OuterNotEmpty = false;
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 					}
 					else
 						node->hj_OuterNotEmpty = true;
@@ -183,7 +187,10 @@ ExecHashJoin(HashJoinState *node)
 				 * outer relation.
 				 */
 				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
-					return NULL;
+				{
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
+				}
 
 				/*
 				 * need to remember whether nbatch has increased since we
@@ -323,7 +330,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -362,7 +370,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -401,7 +410,8 @@ ExecHashJoin(HashJoinState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -414,7 +424,10 @@ ExecHashJoin(HashJoinState *node)
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecHashJoinNewBatch(node))
-					return NULL;	/* end of join */
+				{
+					ExecReturnTuple(&node->js.ps, NULL); /* end of join */
+					return;
+				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..47285a1 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -249,7 +249,7 @@ IndexOnlyRecheck(IndexOnlyScanState *node, TupleTableSlot *slot)
  *		ExecIndexOnlyScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexOnlyScan(IndexOnlyScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index bf16cb1..b08e1b2 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -482,7 +482,7 @@ reorderqueue_pop(IndexScanState *node)
  *		ExecIndexScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexScan(IndexScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 97267c5..4e70183 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -36,7 +36,7 @@ static void pass_down_bound(LimitState *node, PlanState *child_node);
  *		filtering on the stream of tuples returned by a subplan.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLimit(LimitState *node)
 {
 	ScanDirection direction;
@@ -72,7 +72,10 @@ ExecLimit(LimitState *node)
 			 * If backwards scan, just return NULL without changing state.
 			 */
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
@@ -80,7 +83,8 @@ ExecLimit(LimitState *node)
 			if (node->count <= 0 && !node->noCount)
 			{
 				node->lstate = LIMIT_EMPTY;
-				return NULL;
+				ExecReturnTuple(&node->ps, NULL);
+				return;
 			}
 
 			/*
@@ -96,7 +100,8 @@ ExecLimit(LimitState *node)
 					 * any output at all.
 					 */
 					node->lstate = LIMIT_EMPTY;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				if (++node->position > node->offset)
@@ -115,7 +120,8 @@ ExecLimit(LimitState *node)
 			 * The subplan is known to return no tuples (or not more than
 			 * OFFSET tuples, in general).  So we return no tuples.
 			 */
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
@@ -130,7 +136,8 @@ ExecLimit(LimitState *node)
 					node->position - node->offset >= node->count)
 				{
 					node->lstate = LIMIT_WINDOWEND;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -140,7 +147,8 @@ ExecLimit(LimitState *node)
 				if (TupIsNull(slot))
 				{
 					node->lstate = LIMIT_SUBPLANEOF;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				node->position++;
@@ -154,7 +162,8 @@ ExecLimit(LimitState *node)
 				if (node->position <= node->offset + 1)
 				{
 					node->lstate = LIMIT_WINDOWSTART;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -170,7 +179,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_SUBPLANEOF:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from subplan EOF, so re-fetch previous tuple; there
@@ -186,7 +198,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWEND:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from window end: simply re-return the last tuple
@@ -199,7 +214,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWSTART:
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Advancing after having backed off window start: simply
@@ -220,7 +238,7 @@ ExecLimit(LimitState *node)
 	/* Return the current tuple */
 	Assert(!TupIsNull(slot));
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /*
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index c4b5333..8daa203 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -35,7 +35,7 @@
  *		ExecLockRows
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLockRows(LockRowsState *node)
 {
 	TupleTableSlot *slot;
@@ -57,7 +57,10 @@ lnext:
 	slot = ExecProcNode(outerPlan);
 
 	if (TupIsNull(slot))
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
 	epq_needed = false;
@@ -334,7 +337,7 @@ lnext:
 	}
 
 	/* Got all locks, so return the current tuple */
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 82e31c1..fd3b013 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -35,7 +35,7 @@
  *
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* result tuple from subplan */
+void
 ExecMaterial(MaterialState *node)
 {
 	EState	   *estate;
@@ -93,7 +93,11 @@ ExecMaterial(MaterialState *node)
 			 * fetch.
 			 */
 			if (!tuplestore_advance(tuplestorestate, forward))
-				return NULL;	/* the tuplestore must be empty */
+			{
+				/* the tuplestore must be empty */
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
+			}
 		}
 		eof_tuplestore = false;
 	}
@@ -105,7 +109,10 @@ ExecMaterial(MaterialState *node)
 	if (!eof_tuplestore)
 	{
 		if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))
-			return slot;
+		{
+			ExecReturnTuple(&node->ss.ps, slot);
+			return;
+		}
 		if (forward)
 			eof_tuplestore = true;
 	}
@@ -132,7 +139,8 @@ ExecMaterial(MaterialState *node)
 		if (TupIsNull(outerslot))
 		{
 			node->eof_underlying = true;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 
 		/*
@@ -146,13 +154,14 @@ ExecMaterial(MaterialState *node)
 		/*
 		 * We can just return the subplan's returned tuple, without copying.
 		 */
-		return outerslot;
+		ExecReturnTuple(&node->ss.ps, outerslot);
+		return;
 	}
 
 	/*
 	 * Nothing left ...
 	 */
-	return ExecClearTuple(slot);
+	ExecReturnTuple(&node->ss.ps, ExecClearTuple(slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index ae0e8dc..3ef8120 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -164,7 +164,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeAppend(MergeAppendState *node)
 {
 	TupleTableSlot *result;
@@ -214,7 +214,7 @@ ExecMergeAppend(MergeAppendState *node)
 		result = node->ms_slots[i];
 	}
 
-	return result;
+	ExecReturnTuple(&node->ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index cd8d6c6..d73d9f4 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -615,7 +615,7 @@ ExecMergeTupleDump(MergeJoinState *mergestate)
  *		ExecMergeJoin
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeJoin(MergeJoinState *node)
 {
 	List	   *joinqual;
@@ -653,7 +653,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -710,7 +713,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillOuter(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -728,7 +734,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -765,7 +772,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillInner(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -785,7 +795,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -868,7 +879,8 @@ ExecMergeJoin(MergeJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -901,7 +913,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1003,7 +1018,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1039,7 +1057,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1174,7 +1193,8 @@ ExecMergeJoin(MergeJoinState *node)
 								break;
 							}
 							/* Otherwise we're done. */
-							return NULL;
+							ExecReturnTuple(&node->js.ps, NULL);
+							return;
 					}
 				}
 				break;
@@ -1256,7 +1276,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1292,7 +1315,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1318,7 +1342,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1362,7 +1389,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1388,7 +1416,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1406,7 +1437,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(innerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of inner subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDOUTER state and process next tuple. */
@@ -1434,7 +1466,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1448,7 +1483,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(outerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of outer subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDINNER state and process next tuple. */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7bb318a..90107bd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1298,7 +1298,7 @@ fireASTriggers(ModifyTableState *node)
  *		if needed.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecModifyTable(ModifyTableState *node)
 {
 	EState	   *estate = node->ps.state;
@@ -1333,7 +1333,10 @@ ExecModifyTable(ModifyTableState *node)
 	 * extra times.
 	 */
 	if (node->mt_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/*
 	 * On first call, fire BEFORE STATEMENT triggers before proceeding.
@@ -1411,7 +1414,8 @@ ExecModifyTable(ModifyTableState *node)
 			slot = ExecProcessReturning(resultRelInfo, NULL, planSlot);
 
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		EvalPlanQualSetSlot(&node->mt_epqstate, planSlot);
@@ -1517,7 +1521,8 @@ ExecModifyTable(ModifyTableState *node)
 		if (slot)
 		{
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 	}
 
@@ -1531,7 +1536,7 @@ ExecModifyTable(ModifyTableState *node)
 
 	node->mt_done = true;
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 1895b60..54eff56 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -56,7 +56,7 @@
  *			   are prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecNestLoop(NestLoopState *node)
 {
 	NestLoop   *nl;
@@ -93,7 +93,10 @@ ExecNestLoop(NestLoopState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -128,7 +131,8 @@ ExecNestLoop(NestLoopState *node)
 			if (TupIsNull(outerTupleSlot))
 			{
 				ENL1_printf("no outer tuple, ending join");
-				return NULL;
+				ExecReturnTuple(&node->js.ps, NULL);
+				return;
 			}
 
 			ENL1_printf("saving new outer tuple information");
@@ -212,7 +216,8 @@ ExecNestLoop(NestLoopState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -270,7 +275,8 @@ ExecNestLoop(NestLoopState *node)
 				{
 					node->js.ps.ps_TupFromTlist =
 						(isDone == ExprMultipleResult);
-					return result;
+					ExecReturnTuple(&node->js.ps, result);
+					return;
 				}
 			}
 			else
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 2328ef3..6e78eb2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -72,7 +72,7 @@ build_hash_table(RecursiveUnionState *rustate)
  * 2.6 go back to 2.2
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecRecursiveUnion(RecursiveUnionState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
@@ -102,7 +102,8 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 			/* Each non-duplicate tuple goes to the working table ... */
 			tuplestore_puttupleslot(node->working_table, slot);
 			/* ... and to the caller */
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 		node->recursing = true;
 	}
@@ -151,10 +152,11 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 		node->intermediate_empty = false;
 		tuplestore_puttupleslot(node->intermediate_table, slot);
 		/* ... and return it */
-		return slot;
+		ExecReturnTuple(&node->ps, slot);
+		return;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 0d2de14..a830ffd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -63,7 +63,7 @@
  *		'nil' if the constant qualification is not satisfied.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecResult(ResultState *node)
 {
 	TupleTableSlot *outerTupleSlot;
@@ -87,7 +87,8 @@ ExecResult(ResultState *node)
 		if (!qualResult)
 		{
 			node->rs_done = true;
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 		}
 	}
 
@@ -100,7 +101,10 @@ ExecResult(ResultState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -130,7 +134,10 @@ ExecResult(ResultState *node)
 			outerTupleSlot = ExecProcNode(outerPlan);
 
 			if (TupIsNull(outerTupleSlot))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * prepare to compute projection expressions, which will expect to
@@ -157,11 +164,12 @@ ExecResult(ResultState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 9ce7c02..89cce0e 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -95,7 +95,7 @@ SampleRecheck(SampleScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSampleScan(SampleScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f12921d..1c12e27 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -121,7 +121,7 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSeqScan(SeqScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7a3b67c..b7a593f 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -191,7 +191,7 @@ set_output_count(SetOpState *setopstate, SetOpStatePerGroup pergroup)
  *		ExecSetOp
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecSetOp(SetOpState *node)
 {
 	SetOp	   *plannode = (SetOp *) node->ps.plan;
@@ -204,22 +204,26 @@ ExecSetOp(SetOpState *node)
 	if (node->numOutput > 0)
 	{
 		node->numOutput--;
-		return resultTupleSlot;
+		ExecReturnTuple(&node->ps, resultTupleSlot);
+		return;
 	}
 
 	/* Otherwise, we're done if we are out of groups */
 	if (node->setop_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* Fetch the next tuple group according to the correct strategy */
 	if (plannode->strategy == SETOP_HASHED)
 	{
 		if (!node->table_filled)
 			setop_fill_hash_table(node);
-		return setop_retrieve_hash_table(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_hash_table(node));
 	}
 	else
-		return setop_retrieve_direct(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_direct(node));
 }
 
 /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 0286a7f..13f721a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -35,7 +35,7 @@
  *		  -- the outer child is prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSort(SortState *node)
 {
 	EState	   *estate;
@@ -138,7 +138,7 @@ ExecSort(SortState *node)
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  slot, NULL);
-	return slot;
+	ExecReturnTuple(&node->ss.ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 75a28fd..5fae5c5 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -87,7 +87,7 @@ SubqueryRecheck(SubqueryScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSubqueryScan(SubqueryScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 2604103..e2a0479 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -387,7 +387,7 @@ TidRecheck(TidScanState *node, TupleTableSlot *slot)
  *		  -- tidPtr is -1.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecTidScan(TidScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 5d13a89..2daa001 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -42,7 +42,7 @@
  *		ExecUnique
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecUnique(UniqueState *node)
 {
 	Unique	   *plannode = (Unique *) node->ps.plan;
@@ -70,8 +70,8 @@ ExecUnique(UniqueState *node)
 		if (TupIsNull(slot))
 		{
 			/* end of subplan, so we're done */
-			ExecClearTuple(resultTupleSlot);
-			return NULL;
+			ExecReturnTuple(&node->ps, ExecClearTuple(resultTupleSlot));
+			return;
 		}
 
 		/*
@@ -98,7 +98,7 @@ ExecUnique(UniqueState *node)
 	 * won't guarantee that this source tuple is still accessible after
 	 * fetching the next source tuple.
 	 */
-	return ExecCopySlot(resultTupleSlot, slot);
+	ExecReturnTuple(&node->ps, ExecCopySlot(resultTupleSlot, slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 2c4bd9c..6a7dadf 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -172,7 +172,7 @@ ValuesRecheck(ValuesScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecValuesScan(ValuesScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 3dc6757..67f9574 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1555,7 +1555,7 @@ update_frametailpos(WindowObject winobj, TupleTableSlot *slot)
  *	(ignoring the case of SRFs in the targetlist, that is).
  * -----------------
  */
-TupleTableSlot *
+void
 ExecWindowAgg(WindowAggState *winstate)
 {
 	TupleTableSlot *result;
@@ -1565,7 +1565,10 @@ ExecWindowAgg(WindowAggState *winstate)
 	int			numfuncs;
 
 	if (winstate->all_done)
-		return NULL;
+	{
+		ExecReturnTuple(&winstate->ss.ps, NULL);
+		return;
+	}
 
 	/*
 	 * Check to see if we're still projecting out tuples from a previous
@@ -1579,7 +1582,10 @@ ExecWindowAgg(WindowAggState *winstate)
 
 		result = ExecProject(winstate->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&winstate->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		winstate->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1687,7 +1693,8 @@ restart:
 		else
 		{
 			winstate->all_done = true;
-			return NULL;
+			ExecReturnTuple(&winstate->ss.ps, NULL);
+			return;
 		}
 	}
 
@@ -1753,7 +1760,7 @@ restart:
 
 	winstate->ss.ps.ps_TupFromTlist =
 		(isDone == ExprMultipleResult);
-	return result;
+	ExecReturnTuple(&winstate->ss.ps, result);
 }
 
 /* -----------------
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index cfed6e6..c3615b2 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -77,7 +77,7 @@ WorkTableScanRecheck(WorkTableScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecWorkTableScan(WorkTableScanState *node)
 {
 	/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index f1be8fa..087735a 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -228,6 +228,15 @@ extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
+/* Convenience function to set a node's result to a TupleTableSlot. */
+static inline void
+ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
+{
+	Assert(!node->result_ready);
+	node->result = (Node *) slot;
+	node->result_ready = true;
+}
+
 /*
  * prototypes from functions in execQual.c
  */
@@ -256,7 +265,7 @@ extern TupleTableSlot *ExecProject(ProjectionInfo *projInfo,
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
 
-extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
+extern void ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 		 ExecScanRecheckMtd recheckMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, Index varno);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 54c75e8..b86ec6a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAgg(AggState *node);
+extern void ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..70a6b62 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AppendState *ExecInitAppend(Append *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAppend(AppendState *node);
+extern void ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 0ed9c78..069dbc7 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern BitmapHeapScanState *ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecBitmapHeapScan(BitmapHeapScanState *node);
+extern void ExecBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecEndBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecReScanBitmapHeapScan(BitmapHeapScanState *node);
 
diff --git a/src/include/executor/nodeCtescan.h b/src/include/executor/nodeCtescan.h
index ef5c2bc..8411fa1 100644
--- a/src/include/executor/nodeCtescan.h
+++ b/src/include/executor/nodeCtescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern CteScanState *ExecInitCteScan(CteScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecCteScan(CteScanState *node);
+extern void ExecCteScan(CteScanState *node);
 extern void ExecEndCteScan(CteScanState *node);
 extern void ExecReScanCteScan(CteScanState *node);
 
diff --git a/src/include/executor/nodeCustom.h b/src/include/executor/nodeCustom.h
index 9d0b393..f6de3ab 100644
--- a/src/include/executor/nodeCustom.h
+++ b/src/include/executor/nodeCustom.h
@@ -21,7 +21,7 @@
  */
 extern CustomScanState *ExecInitCustomScan(CustomScan *custom_scan,
 				   EState *estate, int eflags);
-extern TupleTableSlot *ExecCustomScan(CustomScanState *node);
+extern void ExecCustomScan(CustomScanState *node);
 extern void ExecEndCustomScan(CustomScanState *node);
 
 extern void ExecReScanCustomScan(CustomScanState *node);
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index c255329..c34a3d6 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecForeignScan(ForeignScanState *node);
+extern void ExecForeignScan(ForeignScanState *node);
 extern void ExecEndForeignScan(ForeignScanState *node);
 extern void ExecReScanForeignScan(ForeignScanState *node);
 
diff --git a/src/include/executor/nodeFunctionscan.h b/src/include/executor/nodeFunctionscan.h
index d6e7a61..15beb13 100644
--- a/src/include/executor/nodeFunctionscan.h
+++ b/src/include/executor/nodeFunctionscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern FunctionScanState *ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecFunctionScan(FunctionScanState *node);
+extern void ExecFunctionScan(FunctionScanState *node);
 extern void ExecEndFunctionScan(FunctionScanState *node);
 extern void ExecReScanFunctionScan(FunctionScanState *node);
 
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
index f76d9be..100a827 100644
--- a/src/include/executor/nodeGather.h
+++ b/src/include/executor/nodeGather.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGather(GatherState *node);
+extern void ExecGather(GatherState *node);
 extern void ExecEndGather(GatherState *node);
 extern void ExecShutdownGather(GatherState *node);
 extern void ExecReScanGather(GatherState *node);
diff --git a/src/include/executor/nodeGroup.h b/src/include/executor/nodeGroup.h
index 92639f5..446ded5 100644
--- a/src/include/executor/nodeGroup.h
+++ b/src/include/executor/nodeGroup.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GroupState *ExecInitGroup(Group *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGroup(GroupState *node);
+extern void ExecGroup(GroupState *node);
 extern void ExecEndGroup(GroupState *node);
 extern void ExecReScanGroup(GroupState *node);
 
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 8cf6d15..b395fd9 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHash(HashState *node);
+extern void ExecHash(HashState *node);
 extern Node *MultiExecHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index f24127a..072c610 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -18,7 +18,7 @@
 #include "storage/buffile.h"
 
 extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+extern void ExecHashJoin(HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
diff --git a/src/include/executor/nodeIndexonlyscan.h b/src/include/executor/nodeIndexonlyscan.h
index d63d194..0fbcf80 100644
--- a/src/include/executor/nodeIndexonlyscan.h
+++ b/src/include/executor/nodeIndexonlyscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexOnlyScanState *ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexOnlyScan(IndexOnlyScanState *node);
+extern void ExecIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecEndIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecIndexOnlyMarkPos(IndexOnlyScanState *node);
 extern void ExecIndexOnlyRestrPos(IndexOnlyScanState *node);
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..341dab3 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
+extern void ExecIndexScan(IndexScanState *node);
 extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 96166b4..03dde30 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLimit(LimitState *node);
+extern void ExecLimit(LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/executor/nodeLockRows.h b/src/include/executor/nodeLockRows.h
index e828e9c..eda3cbec 100644
--- a/src/include/executor/nodeLockRows.h
+++ b/src/include/executor/nodeLockRows.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LockRowsState *ExecInitLockRows(LockRows *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLockRows(LockRowsState *node);
+extern void ExecLockRows(LockRowsState *node);
 extern void ExecEndLockRows(LockRowsState *node);
 extern void ExecReScanLockRows(LockRowsState *node);
 
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index 2b8cae1..20bc7f6 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMaterial(MaterialState *node);
+extern void ExecMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
 extern void ExecMaterialRestrPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergeAppend.h b/src/include/executor/nodeMergeAppend.h
index 0efc489..e43b5e6 100644
--- a/src/include/executor/nodeMergeAppend.h
+++ b/src/include/executor/nodeMergeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeAppendState *ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeAppend(MergeAppendState *node);
+extern void ExecMergeAppend(MergeAppendState *node);
 extern void ExecEndMergeAppend(MergeAppendState *node);
 extern void ExecReScanMergeAppend(MergeAppendState *node);
 
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index 74d691c..dfdbc1b 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
+extern void ExecMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 6b66353..fe67248 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -16,7 +16,7 @@
 #include "nodes/execnodes.h"
 
 extern ModifyTableState *ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecModifyTable(ModifyTableState *node);
+extern void ExecModifyTable(ModifyTableState *node);
 extern void ExecEndModifyTable(ModifyTableState *node);
 extern void ExecReScanModifyTable(ModifyTableState *node);
 
diff --git a/src/include/executor/nodeNestloop.h b/src/include/executor/nodeNestloop.h
index eeb42d6..cab1885 100644
--- a/src/include/executor/nodeNestloop.h
+++ b/src/include/executor/nodeNestloop.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern NestLoopState *ExecInitNestLoop(NestLoop *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecNestLoop(NestLoopState *node);
+extern void ExecNestLoop(NestLoopState *node);
 extern void ExecEndNestLoop(NestLoopState *node);
 extern void ExecReScanNestLoop(NestLoopState *node);
 
diff --git a/src/include/executor/nodeRecursiveunion.h b/src/include/executor/nodeRecursiveunion.h
index 1c08790..fb11eca 100644
--- a/src/include/executor/nodeRecursiveunion.h
+++ b/src/include/executor/nodeRecursiveunion.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern RecursiveUnionState *ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecRecursiveUnion(RecursiveUnionState *node);
+extern void ExecRecursiveUnion(RecursiveUnionState *node);
 extern void ExecEndRecursiveUnion(RecursiveUnionState *node);
 extern void ExecReScanRecursiveUnion(RecursiveUnionState *node);
 
diff --git a/src/include/executor/nodeResult.h b/src/include/executor/nodeResult.h
index 356027f..951fae6 100644
--- a/src/include/executor/nodeResult.h
+++ b/src/include/executor/nodeResult.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ResultState *ExecInitResult(Result *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecResult(ResultState *node);
+extern void ExecResult(ResultState *node);
 extern void ExecEndResult(ResultState *node);
 extern void ExecResultMarkPos(ResultState *node);
 extern void ExecResultRestrPos(ResultState *node);
diff --git a/src/include/executor/nodeSamplescan.h b/src/include/executor/nodeSamplescan.h
index c8f03d8..4ab6e5a 100644
--- a/src/include/executor/nodeSamplescan.h
+++ b/src/include/executor/nodeSamplescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SampleScanState *ExecInitSampleScan(SampleScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSampleScan(SampleScanState *node);
+extern void ExecSampleScan(SampleScanState *node);
 extern void ExecEndSampleScan(SampleScanState *node);
 extern void ExecReScanSampleScan(SampleScanState *node);
 
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index f2e61ff..816d1a5 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern void ExecSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index c6e9603..dd88afb 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSetOp(SetOpState *node);
+extern void ExecSetOp(SetOpState *node);
 extern void ExecEndSetOp(SetOpState *node);
 extern void ExecReScanSetOp(SetOpState *node);
 
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 481065f..f65037d 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSort(SortState *node);
+extern void ExecSort(SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/executor/nodeSubqueryscan.h b/src/include/executor/nodeSubqueryscan.h
index 427699b..a3962c7 100644
--- a/src/include/executor/nodeSubqueryscan.h
+++ b/src/include/executor/nodeSubqueryscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SubqueryScanState *ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSubqueryScan(SubqueryScanState *node);
+extern void ExecSubqueryScan(SubqueryScanState *node);
 extern void ExecEndSubqueryScan(SubqueryScanState *node);
 extern void ExecReScanSubqueryScan(SubqueryScanState *node);
 
diff --git a/src/include/executor/nodeTidscan.h b/src/include/executor/nodeTidscan.h
index 76c2a9f..5b7bbfd 100644
--- a/src/include/executor/nodeTidscan.h
+++ b/src/include/executor/nodeTidscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern TidScanState *ExecInitTidScan(TidScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecTidScan(TidScanState *node);
+extern void ExecTidScan(TidScanState *node);
 extern void ExecEndTidScan(TidScanState *node);
 extern void ExecReScanTidScan(TidScanState *node);
 
diff --git a/src/include/executor/nodeUnique.h b/src/include/executor/nodeUnique.h
index aa8491d..b53a553 100644
--- a/src/include/executor/nodeUnique.h
+++ b/src/include/executor/nodeUnique.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern UniqueState *ExecInitUnique(Unique *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecUnique(UniqueState *node);
+extern void ExecUnique(UniqueState *node);
 extern void ExecEndUnique(UniqueState *node);
 extern void ExecReScanUnique(UniqueState *node);
 
diff --git a/src/include/executor/nodeValuesscan.h b/src/include/executor/nodeValuesscan.h
index 026f261..90288fc 100644
--- a/src/include/executor/nodeValuesscan.h
+++ b/src/include/executor/nodeValuesscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ValuesScanState *ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecValuesScan(ValuesScanState *node);
+extern void ExecValuesScan(ValuesScanState *node);
 extern void ExecEndValuesScan(ValuesScanState *node);
 extern void ExecReScanValuesScan(ValuesScanState *node);
 
diff --git a/src/include/executor/nodeWindowAgg.h b/src/include/executor/nodeWindowAgg.h
index 94ed037..f5e2c98 100644
--- a/src/include/executor/nodeWindowAgg.h
+++ b/src/include/executor/nodeWindowAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WindowAggState *ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWindowAgg(WindowAggState *node);
+extern void ExecWindowAgg(WindowAggState *node);
 extern void ExecEndWindowAgg(WindowAggState *node);
 extern void ExecReScanWindowAgg(WindowAggState *node);
 
diff --git a/src/include/executor/nodeWorktablescan.h b/src/include/executor/nodeWorktablescan.h
index 217208a..7b1eecb 100644
--- a/src/include/executor/nodeWorktablescan.h
+++ b/src/include/executor/nodeWorktablescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WorkTableScanState *ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWorkTableScan(WorkTableScanState *node);
+extern void ExecWorkTableScan(WorkTableScanState *node);
 extern void ExecEndWorkTableScan(WorkTableScanState *node);
 extern void ExecReScanWorkTableScan(WorkTableScanState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7d33b6d..a0bc8af 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1031,6 +1031,8 @@ typedef struct PlanState
 								 * top-level plan */
 
 	struct PlanState *parent;	/* node which will receive tuples from us */
+	bool		result_ready;	/* true if result is ready */
+	Node	   *result;			/* result, most often TupleTableSlot */
 
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
-- 
2.5.4 (Apple Git-61)

0003-Lightweight-framework-for-waiting-for-events.patchtext/x-diff; charset=US-ASCII; name=0003-Lightweight-framework-for-waiting-for-events.patchDownload
From 4209ad4e9d3c46d143de07549061f55f23c50e9d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 9 May 2016 11:48:11 -0400
Subject: [PATCH 3/3] Lightweight framework for waiting for events.

---
 src/backend/executor/Makefile       |   4 +-
 src/backend/executor/execAsync.c    | 256 ++++++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c |  82 ++++++++----
 src/include/executor/execAsync.h    |  23 ++++
 src/include/executor/executor.h     |   2 +
 src/include/nodes/execnodes.h       |  10 ++
 6 files changed, 352 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..20601fa
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,256 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * This file contains routines that are intended to asynchronous
+ * execution; that is, suspending an executor node until some external
+ * event occurs, or until one of its child nodes produces a tuple.
+ * This allows the executor to avoid blocking on a single external event,
+ * such as a file descriptor waiting on I/O, or a parallel worker which
+ * must complete work elsewhere in the plan tree, when there might at the
+ * same time be useful computation that could be accomplished in some
+ * other part of the plan tree.
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/executor.h"
+#include "storage/latch.h"
+
+#define	EVENT_BUFFER_SIZE		16
+
+static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+
+void
+ExecAsyncWaitForNode(PlanState *planstate)
+{
+	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
+	PlanState  *callbacks[EVENT_BUFFER_SIZE];
+	int			ncallbacks = 0;
+	EState *estate = planstate->state;
+
+	while (!planstate->result_ready)
+	{
+		bool	reinit = (estate->es_wait_event_set == NULL);
+		int		n;
+		int		noccurred;
+
+		if (reinit)
+		{
+			/*
+			 * Allow for a few extra events without reinitializing.  It
+			 * doesn't seem worth the complexity of doing anything very
+			 * aggressive here, because plans that depend on massive numbers
+			 * of external FDs are likely to run afoul of kernel limits anyway.
+			 */
+			estate->es_max_async_events = estate->es_total_async_events + 16;
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_max_async_events);
+		}
+
+		/* Give each waiting node a chance to add or modify events. */
+		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+
+		/* Wait for at least one event to occur. */
+		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+									 occurred_event, EVENT_BUFFER_SIZE);
+		Assert(noccurred > 0);
+
+		/*
+		 * Loop over the occurred events and make a list of nodes that need
+		 * a callback.  The waiting nodes should have registered their wait
+		 * events with user_data pointing back to the node.
+		 */
+		for (n = 0; n < noccurred; ++n)
+		{
+			WaitEvent  *w = &occurred_event[n];
+			PlanState  *ps = w->user_data;
+
+			callbacks[ncallbacks++] = ps;
+		}
+
+		/*
+		 * Initially, this loop will call the node-type-specific function for
+		 * each node for which an event occurred.  If any of those nodes
+		 * produce a result, its parent enters the set of nodes that are
+		 * pending for a callback.  In this way, when a result becomes
+		 * available in a leaf of the plan tree, it can bubble upwards towards
+		 * the root as far as necessary.
+		 */
+		while (ncallbacks > 0)
+		{
+			int		i,
+					j;
+
+			/* Loop over all callbacks. */
+			for (i = 0; i < ncallbacks; ++i)
+			{
+				/* Skip if NULL. */
+				if (callbacks[i] == NULL)
+					continue;
+
+				/*
+				 * Remove any duplicates.  O(n) may not seem good, but it
+				 * should hopefully be OK as long as EVENT_BUFFER_SIZE is
+				 * not too large.
+				 */
+				for (j = i + 1; j < ncallbacks; ++j)
+					if (callbacks[i] == callbacks[j])
+						callbacks[j] = NULL;
+
+				/* Dispatch to node-type-specific code. */
+				ExecDispatchNode(callbacks[i]);
+
+				/*
+				 * If there's now a tuple ready, we must dispatch to the
+				 * parent node; otherwise, there's nothing more to do.
+				 */
+				if (callbacks[i]->result_ready)
+					callbacks[i] = callbacks[i]->parent;
+				else
+					callbacks[i] = NULL;
+			}
+
+			/* Squeeze out NULLs. */
+			for (i = 0, j = 0; j < ncallbacks; ++j)
+				if (callbacks[j] != NULL)
+					callbacks[i++] = callbacks[j];
+			ncallbacks = i;
+		}
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one more or events that can be registered on a WaitEventSet.  nevents
+ * should be the maximum number of events that it will wish to register.
+ * reinit should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
+{
+	EState *estate = planstate->state;
+
+	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
+
+	/*
+	 * If this node is not already present in the array of waiting nodes,
+	 * then add it.  If that array hasn't been allocated or is full, this may
+	 * require (re)allocating it.
+	 */
+	if (planstate->n_async_events == 0)
+	{
+		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		{
+			int		newmax;
+
+			if (estate->es_max_waiting_nodes == 0)
+			{
+				newmax = 16;
+				estate->es_waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt, newmax);
+			}
+			else
+			{
+				newmax = estate->es_max_waiting_nodes * 2;
+				estate->es_waiting_nodes =
+					repalloc(estate->es_waiting_nodes,
+							 newmax * sizeof(PlanState *));
+			}
+			estate->es_max_waiting_nodes = newmax;
+		}
+		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+	}
+
+	/* Adjust per-node and per-estate totals. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = nevents;
+	estate->es_total_async_events += planstate->n_async_events;
+
+	/*
+	 * If a WaitEventSet has already been created, we need to discard it and
+	 * start again if the user passed reinit = true, or if the total number of
+	 * required events exceeds the supported number.
+	 */
+	if (estate->es_wait_event_set != NULL && (reinit ||
+		estate->es_total_async_events > estate->es_max_async_events))
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * If an executor node no longer needs to wait, it should call this function
+ * to report that fact.
+ */
+void
+ExecAsyncDoesNotNeedWait(PlanState *planstate)
+{
+	int		n;
+	EState *estate = planstate->state;
+
+	if (planstate->n_async_events <= 0)
+		return;
+
+	/*
+	 * Remove the node from the list of waiting nodes.  (Is a linear search
+	 * going to be a problem here?  I think probably not.)
+	 */
+	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	{
+		if (estate->es_waiting_nodes[n] == planstate)
+		{
+			estate->es_waiting_nodes[n] =
+				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+			break;
+		}
+	}
+
+	/* We should always find ourselves in the array. */
+	Assert(n < estate->es_num_waiting_nodes);
+
+	/* We no longer need any asynchronous events. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = 0;
+
+	/*
+	 * The next wait will need to rebuild the WaitEventSet, because whatever
+	 * events we registered are gone now.  It's probably OK that this code
+	 * assumes we actually did register some events at one point, because we
+	 * needed to wait at some point and we don't any more.
+	 */
+	if (estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * Give per-nodetype function a chance to register wait events.
+ */
+static void
+ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+{
+	switch (nodeTag(planstate))
+	{
+		/* XXX: Add calls to per-nodetype handlers here. */
+		default:
+			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
+	}
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3f2ebff..b7ac08e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -77,6 +77,7 @@
  */
 #include "postgres.h"
 
+#include "executor/execAsync.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "executor/nodeAppend.h"
@@ -368,24 +369,14 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 
 
 /* ----------------------------------------------------------------
- *		ExecProcNode
+ *		ExecDispatchNode
  *
- *		Execute the given node to return a(nother) tuple.
+ *		Invoke the given node's dispatch function.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
-ExecProcNode(PlanState *node)
+void
+ExecDispatchNode(PlanState *node)
 {
-	TupleTableSlot *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -539,22 +530,67 @@ ExecProcNode(PlanState *node)
 
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
 			break;
 	}
 
-	/* We don't support asynchronous execution yet. */
-	Assert(node->result_ready);
+	if (node->instrument)
+	{
+		double	nTuples = 0.0;
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+		if (node->result_ready && node->result != NULL &&
+			IsA(node->result, TupleTableSlot))
+			nTuples = 1.0;
 
-	result = (TupleTableSlot *) node->result;
+		InstrStopNode(node->instrument, nTuples);
+	}
+}
 
-	if (node->instrument)
-		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
-	return result;
+/* ----------------------------------------------------------------
+ *		ExecExecuteNode
+ *
+ *		Request the next tuple from the given node.  Note that
+ *		if the node supports asynchrony, result_ready may not be
+ *		set on return (use ExecProcNode if you need that, or call
+ *		ExecAsyncWaitForNode).
+ * ----------------------------------------------------------------
+ */
+void
+ExecExecuteNode(PlanState *node)
+{
+	node->result_ready = false;
+	ExecDispatchNode(node);
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecProcNode
+ *
+ *		Get the next tuple from the given node.  If the node is
+ *		asynchronous, wait for a tuple to be ready before
+ *		returning.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecProcNode(PlanState *node)
+{
+	CHECK_FOR_INTERRUPTS();
+
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	ExecDispatchNode(node);
+
+	if (!node->result_ready)
+		ExecAsyncWaitForNode(node);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	return (TupleTableSlot *) node->result;
 }
 
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..38b37a1
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.h
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execAsync.h
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncWaitForNode(PlanState *planstate);
+extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
+	bool reinit);
+extern void ExecAsyncDoesNotNeedWait(PlanState *planstate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 087735a..979dea3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -223,6 +223,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
 			 int eflags);
+extern void ExecDispatchNode(PlanState *node);
+extern void ExecExecuteNode(PlanState *node);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a0bc8af..3dba03c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -382,6 +382,14 @@ typedef struct EState
 	ParamListInfo es_param_list_info;	/* values of external params */
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
+	/* Asynchronous execution support */
+	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
+	int			es_num_waiting_nodes;	/* # of waiters in array */
+	int			es_max_waiting_nodes;	/* # of allocated entries */
+	int			es_total_async_events;	/* total of per-node n_async_events */
+	int			es_max_async_events;	/* # supported by event set */
+	struct WaitEventSet *es_wait_event_set;
+
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
 
@@ -1034,6 +1042,8 @@ typedef struct PlanState
 	bool		result_ready;	/* true if result is ready */
 	Node	   *result;			/* result, most often TupleTableSlot */
 
+	int			n_async_events;	/* # of async events we want to register */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
 
-- 
2.5.4 (Apple Git-61)

#2Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

On 9 May 2016 at 19:33, Robert Haas <robertmhaas@gmail.com> wrote:

I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who).

1. asynchronous execution

Not looking at that.

2. vectorized execution...

We might also want to consider storing batches of tuples in a
column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Team is about 2 years into research and coding prototype on those topics at
this point, with agreed time for further work over next 2 years.

I'll let my colleagues chime in with details since I'm not involved at that
level any more.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3David Rowley
david.rowley@2ndquadrant.com
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

On 10 May 2016 at 05:33, Robert Haas <robertmhaas@gmail.com> wrote:

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on. What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat. If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse. But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple. If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more. Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here. We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

It's interesting that you mention this. We identified this as a pain
point during our work on column stores last year. Simply passing
single tuples around the executor is really unfriendly towards L1
instruction cache, plus also the points you mention about L3 cache and
hash tables and tuple stores. I really think that we're likely to see
significant gains by processing >1 tuple at a time, so this topic very
much interests me.

On researching this we've found that other peoples research does
indicate that there are gains to be had:
http://www.openlinksw.com/weblog/oerling/

In that blog there's a table that indicates that this row-store
database saw a 4.4x performance improvement from changing from a
tuple-at-a-time executor to a batch tuple executor.

Batch Size 1 tuple = 122 seconds
Batch Size 10k tuples = 27.7 seconds

When we start multiplying those increases with the increases with
something like parallel query then we're starting to see very nice
gains in performance.

Alvaro, Tomas and I had been discussing this and late last year I did
look into what would be required to allow this to happen in Postgres.
Basically there's 2 sub-projects, I'll describe what I've managed to
learn so far about each, and the rough plan that I have to implement
them:

1. Batch Execution:

a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
indicate if the struct contains a single or a multiple tuples.
Multiple tuples may need to be deformed in a non-lazy fashion in order
to prevent too many buffers from having to be pinned at once. Tuples
will be deformed into arrays of each column rather than arrays for
each tuple (this part is important to support the next sub-project)
c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.

That at least is my ideas for stage 1. There's still more to work out.
e.g should batch mode occur when the query has a LIMIT? we might not
want to waste time gather up extra tuples when we're just going to
stop after the first one. So perhaps 'e' above should be up to the
planner instead. Further development work here might add a node type
that de-batches a TupleTableSlot to allow nodes which don't support
batching to be in the plan, i.e "mixed execution mode". I'm less
excited about this as it may be difficult to cost that operation,
probably the time would be better spend just batch-enabling the other
node types, which *may* not be all that difficult. I'm also assuming
that batch mode (in all cases apart from queries with LIMIT or
cursors) will always be faster than tuple-at-a-time, so requires no
costings from the planner.

2. Vector processing

(I admit that I've given this part much less thought so far, but
here's what I have in mind)

This depends on batch execution, and is intended to allow the executor
to perform function calls to an entire batch at once, rather than
tuple-at-a-time. For example, let's take the following example;

SELECT a+b FROM t;

here (as of now) we'd scan "t" one row at a time and perform a+b after
having deformed enough of the tuple to do that. We'd then go and get
another Tuple from the scan node and repeat until the scan gave us no
more Tuples.

With batch execution we'd fetch multiple Tuples from the scan and we'd
then perform the call to say int4_pl() multiple times, which still
kinda sucks as it means calling int4_pl() possibly millions of times
(once per tuple). The vector mode here would require that we modify
pg_operator to add a vector function for each operator so that we can
call the function passing in an array of Datums and a length to have
SIMD operations perform the addition, so we'd call something like
int4_pl_vector() only once per batch of tuples allowing the CPU to
perform SIMD operations on those datum arrays. This could be done in
an incremental way as the code could just callback on the standard
function in cases where a vectorised version of it is not available.
Thought is needed here about when exactly this decision is made as the
user may not have permissions to execute the vector function, so it
can't simply be a run time check. These functions would simply return
another vector of the results. Aggregates could be given a vector
transition function, where something like COUNT(*)'s vector_transfn
would simply just current_count += vector_length;

This project does appear to require that we bloat the code with 100's
of vector versions of each function. I'm not quite sure if there's a
better way to handle this. The problem is that the fmgr is pretty much
a barrier to SIMD operations, and this was the only idea that I've had
so far about breaking through that barrier. So further ideas here are
very welcome.

The idea here is that these 2 projects help pave the way to bring
columnar storage into PostgreSQL. Without these we're unlikely to get
much benefit of columnar storage as we'd be stuck processing rows one
at a time still. Adding columnar storage on the top of the above
should further increase performance as we can skip the tuple-deform
step and pull columnar array/vectors directly into a TupleTableSlot,
although some trickery would be involved here when the scan has keys.

I just want to add that both of the above do require more thought. We
realised that this was required quite late in our column store work
(which we've all now taken a break from to work on other things), so
we've had little time to look much further into it. Although I should
be starting work again on this in the next few months in the hopes to
have something, even the most simple version of it in 9.7.

Comments are welcome

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Tuesday, May 10, 2016 2:34 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] asynchronous and vectorized execution

Hi,

I realize that we haven't gotten 9.6beta1 out the door yet, but I
think we can't really wait much longer to start having at least some
discussion of 9.7 topics, so I'm going to go ahead and put this one
out there. I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who). To
make a long story short, I think there are several different areas
where we should consider major upgrades to our executor. It's too
slow and it doesn't do everything we want it to do. The main things
on my mind are:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. This case most obviously arises for foreign
tables, where it makes little sense to block on I/O if some other part
of the query tree could benefit from the CPU; consider SELECT * FROM
lt WHERE qual UNION SELECT * FROM ft WHERE qual. It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree. For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

Is the parallel aware Append node sufficient to run multiple nodes
asynchronously? (Sorry, I couldn't have enough time to code the feature
even though we had discussion before.)
If a part of child-nodes are blocked by I/O or other heavy stuff, it
cannot enqueue the results into shm_mq, thus, Gather node naturally
skip nodes that are not ready.
In the above example, scan on foreign-table takes longer lead time than
local scan. If Append can map every child nodes on individual workers,
local scan worker begins to return tuples at first, then, mixed tuples
shall be returned eventually.

However, the process internal asynchronous execution may be also beneficial
in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
are given to worker process). I think it allows to implement pre-fetching
very naturally.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior,

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on. What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat. If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse.

I can agree with the above concern from my experience. Each HashJoin
step needs to fill up TupleTableSlot for each depth. Mostly, it is
just relocation of the attributes in case of multi-tables joins.

If HashJoin could gather five underlying hash-tables at once, it can
reduce unnecessary setup of intermediation tuples.
A position example is GpuHashJoin in PG-Strom. It constructs multi-
relations hash table, then, produce joined tuples at once.
Its performance is generally good.

But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple. If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more. Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here. We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Obviously, both of these are big projects that could touch a large
amount of executor code, and there may be other ideas, in addition to
these, which some of you may be thinking about that could also touch a
large amount of executor code. It would be nice to agree on a way
forward that minimizes code churn and maximizes everyone's attempt to
contribute without conflicting with each other. Also, it seems
desirable to enable, as far as possible, incremental development - in
particular, it seems to me that it would be good to pick a design that
doesn't require massive changes to every node all at once. A single
patch that adds some capability to every node in the executor in one
fell swoop is going to be too large to review effectively.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple.

Backward compatibility is good. In addition, child node may want to
know the context when it is called. It may want to switch the behavior
according to the caller's expectation. For example, it may be beneficial
if SeqScan makes more aggressive prefetching on asynchronous execution.

Also, can we consider which data format will be returned from the child
node during the planning stage? It affects to the cost of inter-node
data exchange. If a pair of parent-node and child-node supports its
special data format (like as existing HashJoin and Hash doing), it shall
be a discount factor of cost estimation.

Similarly, for vectorized execution,
a node might return a bunch of tuples all at once. ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted. In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today. Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities. They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it. If not, ExecProcNode will paper over the problem. This
seems to me to be a good way to keep the code simple.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them. There is also a result_ready boolean, so
that a node can return without setting this Boolean to engage
asynchronous behavior. This triggers an "event loop", which
repeatedly waits for FDs chosen by waiting nodes to become readable
and/or writeable and then gives the node a chance to react.
Eventually, the waiting node will stop waiting and have a result
ready, at which point the event loop will give the parent of that node
a chance to run. If that node consequently becomes ready, then its
parent gets a chance to run. Eventually (we hope), the node for which
we're waiting becomes ready, and we can then read a result tuple.
With some more work, this seems like it can handle the FDW case, but I
haven't worked out how to make it handle the related parallel query
case. What we want there is to wait not for the readiness of an FD
but rather for some other process involved in the parallel query to
reach a point where it can welcome assistance executing that node. I
don't know exactly what the signaling for that should look like yet -
maybe setting the process latch or something.

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
/*
* Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
* (and this is very bad) - so, here we do not check are keys ok or not.
*/
return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster. Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Thoughts, ideas, suggestions, etc. very welcome.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Greg Stark
stark@mit.edu
In reply to: David Rowley (#3)
Re: asynchronous and vectorized execution

On 9 May 2016 8:34 pm, "David Rowley" <david.rowley@2ndquadrant.com> wrote:

This project does appear to require that we bloat the code with 100's
of vector versions of each function. I'm not quite sure if there's a
better way to handle this. The problem is that the fmgr is pretty much
a barrier to SIMD operations, and this was the only idea that I've had
so far about breaking through that barrier. So further ideas here are
very welcome.

Well yes and no. In practice I think you only need to worry about
vectorised versions of integer and possibly float. For other data types
there either aren't vectorised operators or there's little using them.

And I'll make a bold claim here that the only operators I think really
matter are =

The rain is because using SIMD instructions is a minor win if you have any
further work to do per tuple. The only time it's a big win is if you're
eliminating entire tuples from consideration efficiently. = is going to do
that often, other btree operator classes might be somewhat useful, but
things like + really only would come up in odd examples.

But even that understates things. If you have column oriented storage then
= becomes even more important since every scan has a series of implied
equijoins to reconstruct the tuple. And the coup de grace is that in a
column oriented storage you try to store variable length data as integer
indexes into a dictionary of common values so *everything* is an integer =
operation.

How to do this without punching right through the executor as an
abstraction and still supporting extensible data types and operators was
puzzling me already. I do think it involves having these vector operators
in the catalogue and also some kind of compression mapping to integer
indexes. But I'm not sure that's all that would be needed.

#6David Rowley
david.rowley@2ndquadrant.com
In reply to: Kouhei Kaigai (#4)
Re: asynchronous and vectorized execution

On 10 May 2016 at 13:38, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

I imagined that any decent compiler would have built the code to use
jump tables for this. I have to say that I've never checked to make
sure though.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: David Rowley (#6)
Re: asynchronous and vectorized execution

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of David Rowley
Sent: Tuesday, May 10, 2016 2:01 PM
To: Kaigai Kouhei(海外 浩平)
Cc: Robert Haas; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] asynchronous and vectorized execution

On 10 May 2016 at 13:38, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

I imagined that any decent compiler would have built the code to use
jump tables for this. I have to say that I've never checked to make
sure though.

Ah, indeed, you are right. Please forget above part.

In GCC 4.8.5, the case label between T_ResultState and T_LimitState were
handled using jump table.

TupleTableSlot *
ExecProcNode(PlanState *node)
{
:
<snip>
:
switch (nodeTag(node))
5ad361: 8b 03 mov (%rbx),%eax
5ad363: 2d c9 00 00 00 sub $0xc9,%eax
5ad368: 83 f8 24 cmp $0x24,%eax
5ad36b: 0f 87 4f 02 00 00 ja 5ad5c0 <ExecProcNode+0x290>
5ad371: ff 24 c5 68 48 8b 00 jmpq *0x8b4868(,%rax,8)
5ad378: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
5ad37f: 00

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8David Rowley
david.rowley@2ndquadrant.com
In reply to: Greg Stark (#5)
Re: asynchronous and vectorized execution

On 10 May 2016 at 16:34, Greg Stark <stark@mit.edu> wrote:

On 9 May 2016 8:34 pm, "David Rowley" <david.rowley@2ndquadrant.com> wrote:

This project does appear to require that we bloat the code with 100's
of vector versions of each function. I'm not quite sure if there's a
better way to handle this. The problem is that the fmgr is pretty much
a barrier to SIMD operations, and this was the only idea that I've had
so far about breaking through that barrier. So further ideas here are
very welcome.

Well yes and no. In practice I think you only need to worry about vectorised
versions of integer and possibly float. For other data types there either
aren't vectorised operators or there's little using them.

And I'll make a bold claim here that the only operators I think really
matter are =

The rain is because using SIMD instructions is a minor win if you have any
further work to do per tuple. The only time it's a big win is if you're
eliminating entire tuples from consideration efficiently. = is going to do
that often, other btree operator classes might be somewhat useful, but
things like + really only would come up in odd examples.

But even that understates things. If you have column oriented storage then =
becomes even more important since every scan has a series of implied
equijoins to reconstruct the tuple. And the coup de grace is that in a
column oriented storage you try to store variable length data as integer
indexes into a dictionary of common values so *everything* is an integer =
operation.

How to do this without punching right through the executor as an abstraction
and still supporting extensible data types and operators was puzzling me
already. I do think it involves having these vector operators in the
catalogue and also some kind of compression mapping to integer indexes. But
I'm not sure that's all that would be needed.

Perhaps the first move to make on this front will be for aggregate
functions. Experimentation might be quite simple to realise which
functions will bring enough benefit. I imagined that even Datums where
the type is not processor native might yield a small speedup, not from
SIMD, but just from less calls through fmgr. Perhaps we'll realise
that those are not worth the trouble, I've no idea at this stage.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Pavel Stehule
pavel.stehule@gmail.com
In reply to: David Rowley (#8)
Re: asynchronous and vectorized execution

2016-05-10 8:05 GMT+02:00 David Rowley <david.rowley@2ndquadrant.com>:

On 10 May 2016 at 16:34, Greg Stark <stark@mit.edu> wrote:

On 9 May 2016 8:34 pm, "David Rowley" <david.rowley@2ndquadrant.com>

wrote:

This project does appear to require that we bloat the code with 100's
of vector versions of each function. I'm not quite sure if there's a
better way to handle this. The problem is that the fmgr is pretty much
a barrier to SIMD operations, and this was the only idea that I've had
so far about breaking through that barrier. So further ideas here are
very welcome.

Well yes and no. In practice I think you only need to worry about

vectorised

versions of integer and possibly float. For other data types there either
aren't vectorised operators or there's little using them.

And I'll make a bold claim here that the only operators I think really
matter are =

The rain is because using SIMD instructions is a minor win if you have

any

further work to do per tuple. The only time it's a big win is if you're
eliminating entire tuples from consideration efficiently. = is going to

do

that often, other btree operator classes might be somewhat useful, but
things like + really only would come up in odd examples.

But even that understates things. If you have column oriented storage

then =

becomes even more important since every scan has a series of implied
equijoins to reconstruct the tuple. And the coup de grace is that in a
column oriented storage you try to store variable length data as integer
indexes into a dictionary of common values so *everything* is an integer

=

operation.

How to do this without punching right through the executor as an

abstraction

and still supporting extensible data types and operators was puzzling me
already. I do think it involves having these vector operators in the
catalogue and also some kind of compression mapping to integer indexes.

But

I'm not sure that's all that would be needed.

Perhaps the first move to make on this front will be for aggregate
functions. Experimentation might be quite simple to realise which
functions will bring enough benefit. I imagined that even Datums where
the type is not processor native might yield a small speedup, not from
SIMD, but just from less calls through fmgr. Perhaps we'll realise
that those are not worth the trouble, I've no idea at this stage.

It can be reduced to sum and count in first iteration. On other hand lot of
OLAP reports is based on pretty complex expressions - and there probably
the compilation is better way.

Regards

Pavel

Show quoted text

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10konstantin knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

Hi,

1. asynchronous execution,

It seems to me that asynchronous execution can be considered as alternative to multithreading model (in case of PostgreSQL the roles of threads are played by workers).
Async. operations are used to have smaller overhead but have scalability problems (because in most implementation of cooperative multitasking there is just one processing thread and so it can not consume more than one core).

So I wonder whether asynchronous execution is trying to achieve that same goal as parallel query execution but using slightly different mechanism.
You wrote:

in the meantime, any worker that arrives at that scan node has no choice but to block.

What's wrong with it that worker is blocked? You can just have more workers (more than CPU cores) to let other of them continue to do useful work.
But I agree that

Whether or not this will be efficient is a research question

2. vectorized execution

Vector IO is very important for columnar store. In IMCS extension (in-memory columnar store) using vector operations allows to increase speed 10-100 times depending on size of data set and query. Obviously the best results are for grand aggregates.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation native code for query execution plan (which is not so difficult now, thanks to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL. Instead of native code generation I used conversion of query plans to C code and experiment with different data representation. "Horisontal model" with loading columns on demands shows better performance than columnar store.

As far as I know native code generator is currently developed for PostgreSQL by ISP RAN
Sorry, slides in Russian:
https://pgconf.ru/media/2016/02/19/6%20Мельник%20Дмитрий%20Михайлович,%2005-02-2016.pdf

At this moment (February) them have implemented translation of only few PostgreSQL operators used by ExecQuals and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25% increase at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open source...

On May 9, 2016, at 8:33 PM, Robert Haas wrote:

Show quoted text

Hi,

I realize that we haven't gotten 9.6beta1 out the door yet, but I
think we can't really wait much longer to start having at least some
discussion of 9.7 topics, so I'm going to go ahead and put this one
out there. I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who). To
make a long story short, I think there are several different areas
where we should consider major upgrades to our executor. It's too
slow and it doesn't do everything we want it to do. The main things
on my mind are:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. This case most obviously arises for foreign
tables, where it makes little sense to block on I/O if some other part
of the query tree could benefit from the CPU; consider SELECT * FROM
lt WHERE qual UNION SELECT * FROM ft WHERE qual. It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree. For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on. What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat. If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse. But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple. If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more. Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here. We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Obviously, both of these are big projects that could touch a large
amount of executor code, and there may be other ideas, in addition to
these, which some of you may be thinking about that could also touch a
large amount of executor code. It would be nice to agree on a way
forward that minimizes code churn and maximizes everyone's attempt to
contribute without conflicting with each other. Also, it seems
desirable to enable, as far as possible, incremental development - in
particular, it seems to me that it would be good to pick a design that
doesn't require massive changes to every node all at once. A single
patch that adds some capability to every node in the executor in one
fell swoop is going to be too large to review effectively.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple. Similarly, for vectorized execution,
a node might return a bunch of tuples all at once. ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted. In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today. Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities. They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it. If not, ExecProcNode will paper over the problem. This
seems to me to be a good way to keep the code simple.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them. There is also a result_ready boolean, so
that a node can return without setting this Boolean to engage
asynchronous behavior. This triggers an "event loop", which
repeatedly waits for FDs chosen by waiting nodes to become readable
and/or writeable and then gives the node a chance to react.
Eventually, the waiting node will stop waiting and have a result
ready, at which point the event loop will give the parent of that node
a chance to run. If that node consequently becomes ready, then its
parent gets a chance to run. Eventually (we hope), the node for which
we're waiting becomes ready, and we can then read a result tuple.
With some more work, this seems like it can handle the FDW case, but I
haven't worked out how to make it handle the related parallel query
case. What we want there is to wait not for the readiness of an FD
but rather for some other process involved in the parallel query to
reach a point where it can welcome assistance executing that node. I
don't know exactly what the signaling for that should look like yet -
maybe setting the process latch or something.

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
/*
* Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
* (and this is very bad) - so, here we do not check are keys ok or not.
*/
return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster. Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Thoughts, ideas, suggestions, etc. very welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
<0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patch><0002-Modify-PlanState-to-have-result-result_ready-fields.patch><0003-Lightweight-framework-for-waiting-for-events.patch>
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Robert Haas (#1)
1 attachment(s)
Re: asynchronous and vectorized execution

Hello.

At Mon, 9 May 2016 13:33:55 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com>

Hi,

I realize that we haven't gotten 9.6beta1 out the door yet, but I
think we can't really wait much longer to start having at least some
discussion of 9.7 topics, so I'm going to go ahead and put this one
out there. I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who). To
make a long story short, I think there are several different areas
where we should consider major upgrades to our executor. It's too
slow and it doesn't do everything we want it to do. The main things
on my mind are:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. This case most obviously arises for foreign
tables, where it makes little sense to block on I/O if some other part
of the query tree could benefit from the CPU; consider SELECT * FROM
lt WHERE qual UNION SELECT * FROM ft WHERE qual.

This is my main concern and what I wanted to solve.

It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree.

Especially for foreign tables, there must be gaps between sending
FETCH and getting the result. Visiting other tables is very
effective to fill the gaps. Using file descriptors is greatly
helps this in effective way, thanks to the new API
WaitEventSet. The attached is a WiP of PoC (sorry for including
some debug code and irrelevant code) of that. It is a bit
different in Exec* APIs from the 0002 patch but works even only
for postgres-fdw and append. It embeds waiting code into
ExecAppend but easily replaceable with the framework in the
Robert's 0003 patch.

Apart from the core part, for postgres-fdw, some scans resides
together on one connection. These scans share the same FD but
there's no means to identify for which scan-node the fd is
signalled. To handle the situation, we might need 'seemed to be
ready but really not' route.

For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

One annoyance of this method is one FD with latch-like data
drain. Since we should provide FDs for such nodes, gather would
may have another data-passing channel on the FDs.

And I want to realize early-execution of async nodes. This might
need that all types of node return 'not-ready' for the first call
even if it is async-capable.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on. What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat. If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse. But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple. If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more. Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here. We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Obviously, both of these are big projects that could touch a large
amount of executor code, and there may be other ideas, in addition to
these, which some of you may be thinking about that could also touch a
large amount of executor code. It would be nice to agree on a way
forward that minimizes code churn and maximizes everyone's attempt to
contribute without conflicting with each other. Also, it seems
desirable to enable, as far as possible, incremental development - in
particular, it seems to me that it would be good to pick a design that
doesn't require massive changes to every node all at once. A single
patch that adds some capability to every node in the executor in one
fell swoop is going to be too large to review effectively.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple. Similarly, for vectorized execution,
a node might return a bunch of tuples all at once. ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted. In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today. Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities. They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it. If not, ExecProcNode will paper over the problem. This
seems to me to be a good way to keep the code simple.

Agreed to returning not-ready state and wrapping nodes to
disguise old-style API, but I suppose Exec* may return a tuple as
it does corrently.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them. There is also a result_ready boolean, so
that a node can return without setting this Boolean to engage
asynchronous behavior. This triggers an "event loop", which
repeatedly waits for FDs chosen by waiting nodes to become readable
and/or writeable and then gives the node a chance to react.
Eventually, the waiting node will stop waiting and have a result
ready, at which point the event loop will give the parent of that node
a chance to run. If that node consequently becomes ready, then its
parent gets a chance to run. Eventually (we hope), the node for which
we're waiting becomes ready, and we can then read a result tuple.

I thought almost the same, even only for AppendNode..

With some more work, this seems like it can handle the FDW case, but I
haven't worked out how to make it handle the related parallel query
case. What we want there is to wait not for the readiness of an FD
but rather for some other process involved in the parallel query to
reach a point where it can welcome assistance executing that node. I
don't know exactly what the signaling for that should look like yet -
maybe setting the process latch or something.

Agreed as described above.

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
/*
* Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
* (and this is very bad) - so, here we do not check are keys ok or not.
*/
return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster. Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

The cost of pushing-down keys on seqscans seems calucalatable
with a maybe-small amount of computation. So I suppose it is
promising.

Thoughts, ideas, suggestions, etc. very welcome.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

PoC-async-exec-horiguchi-20160510.difftext/x-patch; charset=us-asciiDownload
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 2f49268..49e334f 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -120,6 +120,14 @@ enum FdwDirectModifyPrivateIndex
 	FdwDirectModifyPrivateSetProcessed
 };
 
+typedef enum PgFdwFetchState
+{
+	PGFDWFETCH_IDLE,
+	PGFDWFETCH_WAITING,
+	PGFDWFETCH_READY,
+	PGFDWFETCH_EOF
+} PgFdwFetchState;
+
 /*
  * Execution state of a foreign scan using postgres_fdw.
  */
@@ -151,6 +159,8 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		is_async;
+	PgFdwFetchState fetch_status;
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -1248,7 +1258,7 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 */
 	fsstate = (PgFdwScanState *) palloc0(sizeof(PgFdwScanState));
 	node->fdw_state = (void *) fsstate;
-
+	fsstate->is_async = ((eflags & EXEC_FLAG_ASYNC) != 0);
 	/*
 	 * Obtain the foreign server where to connect and user mapping to use for
 	 * connection. For base relations we obtain this information from
@@ -1287,6 +1297,9 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 */
 	fsstate->conn = GetConnection(user, false);
 
+	/* Set a waiting fd to allow asynchronous waiting in upper node */
+	node->ss.ps.fd = PQsocket(fsstate->conn);
+
 	/* Assign a unique ID for my cursor */
 	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
 	fsstate->cursor_exists = false;
@@ -1359,12 +1372,22 @@ postgresIterateForeignScan(ForeignScanState *node)
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
-		if (fsstate->next_tuple >= fsstate->num_tuples)
+		fetch_more_data(node);
+		if (fsstate->fetch_status == PGFDWFETCH_WAITING)
+		{
+			/*
+			 * fetch_more_data just sent the asynchronous query for next
+			 * output, so ask the caller to visit the next table.
+			 */
+			node->ss.ps.exec_status = EXEC_NOT_READY;
+			return ExecClearTuple(slot);
+		}
+		else if (fsstate->fetch_status == PGFDWFETCH_EOF)
+		{
+			/* fetch_more_data give no more tuples */
+			node->ss.ps.exec_status = EXEC_EOT;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
@@ -2872,7 +2895,9 @@ fetch_more_data(ForeignScanState *node)
 	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
-
+	PGconn	   *conn = fsstate->conn;
+	char		sql[64];
+	
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
 	 * batch.
@@ -2881,18 +2906,51 @@ fetch_more_data(ForeignScanState *node)
 	MemoryContextReset(fsstate->batch_cxt);
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+	if (fsstate->fetch_status != PGFDWFETCH_WAITING)
+	{
+		/*
+		 * If we reached the final tuple in previous call, no more tuple will
+		 * be fetched this time.
+		 */
+		if (fsstate->eof_reached)
+		{
+			fsstate->fetch_status = PGFDWFETCH_EOF;
+			return;
+		}
+
+		if (!PQsendQuery(conn, sql))
+			pgfdw_report_error(ERROR, NULL, conn, false, sql);
+		fsstate->fetch_status = PGFDWFETCH_WAITING;
+
+		/*
+		 * When currently on a connection running asynchronous fetching, we
+		 * return immediately here.
+		 */
+		if (fsstate->is_async)
+			return;
+	}
+	else
+	{
+		Assert(fsstate->is_async);
+		if (!PQconsumeInput(conn))
+			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
+				
+		if (PQisBusy(conn))
+			return;
+	}
+
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
-		char		sql[64];
-		int			numrows;
 		int			i;
+		int			numrows;
 
-		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
-				 fsstate->fetch_size, fsstate->cursor_number);
+		res = pgfdw_get_result(conn, sql);
+		fsstate->fetch_status = PGFDWFETCH_READY;
 
-		res = pgfdw_exec_query(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
@@ -2923,6 +2981,10 @@ fetch_more_data(ForeignScanState *node)
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
 		fsstate->eof_reached = (numrows < fsstate->fetch_size);
 
+		/* But don't return EOF if any tuple available */
+		if (numrows == 0)
+			fsstate->fetch_status = PGFDWFETCH_EOF;
+
 		PQclear(res);
 		res = NULL;
 	}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ac02304..f76fc94 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1553,6 +1553,8 @@ ExecutePlan(EState *estate,
 	if (use_parallel_mode)
 		EnterParallelMode();
 
+	ExecStartNode(planstate);
+
 	/*
 	 * Loop until we've processed the proper number of tuples from the plan.
 	 */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..590b28e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -383,6 +383,8 @@ ExecProcNode(PlanState *node)
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
+	node->exec_status = EXEC_READY;
+
 	switch (nodeTag(node))
 	{
 			/*
@@ -540,6 +542,10 @@ ExecProcNode(PlanState *node)
 	if (node->instrument)
 		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
+	if (TupIsNull(result) &&
+		node->exec_status == EXEC_READY)
+		node->exec_status = EXEC_EOT;
+
 	return result;
 }
 
@@ -786,6 +792,30 @@ ExecEndNode(PlanState *node)
 }
 
 /*
+ * ExecStartNode - execute registered early-startup callbacks
+ */
+bool
+ExecStartNode(PlanState *node)
+{
+	if (node == NULL)
+		return false;
+
+	switch (nodeTag(node))
+	{
+	case T_GatherState:
+		return ExecStartGather((GatherState *)node);
+		break;
+	case T_SeqScanState:
+		return ExecStartSeqScan((SeqScanState *)node);
+		break;
+	default:
+		break;	
+	}
+
+	return planstate_tree_walker(node, ExecStartNode, NULL);
+}
+
+/*
  * ExecShutdownNode
  *
  * Give execution nodes a chance to stop asynchronous resource consumption
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 0c1e4a3..95130b0 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2344,6 +2344,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate = makeNode(AggState);
 	aggstate->ss.ps.plan = (Plan *) node;
 	aggstate->ss.ps.state = estate;
+	aggstate->ss.ps.fd = -1;
 
 	aggstate->aggs = NIL;
 	aggstate->numaggs = 0;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..004c621 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -121,9 +121,11 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 {
 	AppendState *appendstate = makeNode(AppendState);
 	PlanState **appendplanstates;
+	AppendAsyncState *asyncstates;
 	int			nplans;
 	int			i;
 	ListCell   *lc;
+	bool		has_async_child = false;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & EXEC_FLAG_MARK));
@@ -134,14 +136,22 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	nplans = list_length(node->appendplans);
 
 	appendplanstates = (PlanState **) palloc0(nplans * sizeof(PlanState *));
+	asyncstates =
+		(AppendAsyncState *) palloc0(nplans * sizeof(AppendAsyncState));
+	for (i = 0 ; i < nplans ; i++)
+		asyncstates[i] = ASYNCCHILD_READY;
 
 	/*
 	 * create new AppendState for our append node
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
+	appendstate->ps.fd = -1;
 	appendstate->appendplans = appendplanstates;
+	appendstate->async_state = asyncstates;
 	appendstate->as_nplans = nplans;
+	appendstate->evset = CreateWaitEventSet(CurrentMemoryContext,
+											list_length(node->appendplans));
 
 	/*
 	 * Miscellaneous initialization
@@ -165,9 +175,28 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
+		/* always request async-execition for children */
+		eflags |= EXEC_FLAG_ASYNC;
 		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+
+		/*
+		 * A child that can scan asynchronously sets a file descriptor for
+		 * polling on them during initialization.
+		 */
+		if (appendplanstates[i]->fd >= 0)
+		{
+			AddWaitEventToSet(appendstate->evset, WL_SOCKET_READABLE,
+							  appendplanstates[i]->fd, NULL,
+							  (void *)i);
+			has_async_child = true;
+		}
 		i++;
 	}
+	if (!has_async_child)
+	{
+		FreeWaitEventSet(appendstate->evset);
+		appendstate->evset = NULL;
+	}
 
 	/*
 	 * initialize output tuple type
@@ -193,45 +222,86 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecAppend(AppendState *node)
 {
-	for (;;)
+	int n_notready = 1;
+
+	while (n_notready > 0)
 	{
-		PlanState  *subnode;
 		TupleTableSlot *result;
+		PlanState  *subnode;
+		int i, n;
 
-		/*
-		 * figure out which subplan we are currently processing
-		 */
-		subnode = node->appendplans[node->as_whichplan];
+		/* Scan the children in a round-robin policy. */
+		n_notready = 0;
+		n = node->as_whichplan;
+		for (i = 0 ; i < node->as_nplans ; i++, n++)
+		{
+			if (n >= node->as_nplans) n = 0;
 
-		/*
-		 * get a tuple from the subplan
-		 */
-		result = ExecProcNode(subnode);
+			if (node->async_state[n] != ASYNCCHILD_READY)
+			{
+				if (node->async_state[n] == ASYNCCHILD_NOT_READY)
+					n_notready++;
+				continue;
+			}
+
+			subnode = node->appendplans[n];
+
+			result = ExecProcNode(subnode);
 
-		if (!TupIsNull(result))
-		{
 			/*
 			 * If the subplan gave us something then return it as-is. We do
 			 * NOT make use of the result slot that was set up in
 			 * ExecInitAppend; there's no need for it.
 			 */
-			return result;
+			switch (subnode->exec_status)
+			{
+			case  EXEC_READY:
+				node->as_whichplan = n;
+				return result;
+
+			case  EXEC_NOT_READY:
+				node->async_state[n] = ASYNCCHILD_NOT_READY;
+				n_notready++;
+				break;
+
+			case EXEC_EOT:
+				node->async_state[n] = ASYNCCHILD_DONE;
+				break;
+
+			default:
+				elog(ERROR, "Unkown node status: %d", subnode->exec_status);
+			}				
 		}
 
 		/*
-		 * Go on to the "next" subplan in the appropriate direction. If no
-		 * more subplans, return the empty slot set up for us by
-		 * ExecInitAppend.
+		 * If we have any "not ready "children after no children can return a
+		 * tuple, wait any of them to be ready.
 		 */
-		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
-		else
-			node->as_whichplan--;
-		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
-
-		/* Else loop back and try to get a tuple from the new subplan */
+		if (n_notready > 0)
+		{
+			WaitEvent occurred_events[5];
+			int nevents;
+			int i;
+
+			nevents = WaitEventSetWait(node->evset, -1, occurred_events, 5);
+			Assert(nevents > 0);
+			for (i = 0 ; i < nevents ; i++)
+			{
+				int plannum = (int)occurred_events[i].user_data;
+				node->async_state[plannum] = ASYNCCHILD_READY;
+			}
+			node->as_whichplan = (int)occurred_events[0].user_data;
+			continue;
+		}
+	}
+
+	/* All children exhausted. Free the wait event set if exists */
+	if (node->evset)
+	{
+		FreeWaitEventSet(node->evset);
+		node->evset = NULL;
 	}
+	return NULL;
 }
 
 /* ----------------------------------------------------------------
@@ -271,6 +341,7 @@ ExecReScanAppend(AppendState *node)
 	{
 		PlanState  *subnode = node->appendplans[i];
 
+		node->async_state[i] = ASYNCCHILD_READY;
 		/*
 		 * ExecReScan doesn't know about my subplans, so I have to do
 		 * changed-parameter signaling myself.
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index c39d790..3942285 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -63,6 +63,7 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	 */
 	bitmapandstate->ps.plan = (Plan *) node;
 	bitmapandstate->ps.state = estate;
+	bitmapandstate->ps.fd = -1;
 	bitmapandstate->bitmapplans = bitmapplanstates;
 	bitmapandstate->nplans = nplans;
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..cc89d56 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -556,6 +556,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate = makeNode(BitmapHeapScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 
 	scanstate->tbm = NULL;
 	scanstate->tbmiterator = NULL;
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index a364098..d799292 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -206,6 +206,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	indexstate = makeNode(BitmapIndexScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ps.fd = -1;
 
 	/* normally we don't make the result bitmap till runtime */
 	indexstate->biss_result = NULL;
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 7e928eb..5f06ce9 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -64,6 +64,7 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	 */
 	bitmaporstate->ps.plan = (Plan *) node;
 	bitmaporstate->ps.state = estate;
+	bitmaporstate->ps.fd = -1;
 	bitmaporstate->bitmapplans = bitmapplanstates;
 	bitmaporstate->nplans = nplans;
 
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 3c2f684..6f09853 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -191,6 +191,7 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 	scanstate = makeNode(CteScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 	scanstate->eflags = eflags;
 	scanstate->cte_table = NULL;
 	scanstate->eof_cte = false;
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 322abca..e825001 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -44,6 +44,7 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	/* fill up fields of ScanState */
 	css->ss.ps.plan = &cscan->scan.plan;
 	css->ss.ps.state = estate;
+	css->ss.ps.fd = -1;
 
 	/* create expression context for node */
 	ExecAssignExprContext(estate, &css->ss.ps);
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 300f947..4079529 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -144,6 +144,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate = makeNode(ForeignScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index a03f6e7..7d508da 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -299,6 +299,7 @@ ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags)
 	scanstate = makeNode(FunctionScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 	scanstate->eflags = eflags;
 
 	/*
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 3834ed6..60a1598 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -46,6 +46,88 @@ static TupleTableSlot *gather_getnext(GatherState *gatherstate);
 static HeapTuple gather_readnext(GatherState *gatherstate);
 static void ExecShutdownGatherWorkers(GatherState *node);
 
+/* ----------------------------------------------------------------
+ *		StartGather
+ *
+ *		Gather node can have an advantage from asynchronous execution in most
+ *		cases because of its startup cost.
+ *		----------------------------------------------------------------
+ */
+bool
+ExecStartGather(GatherState *node)
+{
+	EState	   *estate = node->ps.state;
+	Gather	   *gather = (Gather *) node->ps.plan;
+	TupleTableSlot *fslot = node->funnel_slot;
+	int i;
+
+	/* Don't start if already started or explicitly inhibited by the upper */
+	if (node->initialized || !node->early_start)
+		return false;
+
+	/*
+	 * Initialize the parallel context and workers on first execution. We do
+	 * this on first execution rather than during node initialization, as it
+	 * needs to allocate large dynamic segment, so it is better to do if it
+	 * is really needed.
+	 */
+
+	/*
+	 * Sometimes we might have to run without parallelism; but if
+	 * parallel mode is active then we can try to fire up some workers.
+	 */
+	if (gather->num_workers > 0 && IsInParallelMode())
+	{
+		ParallelContext *pcxt;
+		bool	got_any_worker = false;
+
+		/* Initialize the workers required to execute Gather node. */
+		if (!node->pei)
+			node->pei = ExecInitParallelPlan(node->ps.lefttree,
+											 estate,
+											 gather->num_workers);
+
+		/*
+		 * Register backend workers. We might not get as many as we
+		 * requested, or indeed any at all.
+		 */
+		pcxt = node->pei->pcxt;
+		LaunchParallelWorkers(pcxt);
+
+		/* Set up tuple queue readers to read the results. */
+		if (pcxt->nworkers > 0)
+		{
+			node->nreaders = 0;
+			node->reader =
+				palloc(pcxt->nworkers * sizeof(TupleQueueReader *));
+
+			for (i = 0; i < pcxt->nworkers; ++i)
+			{
+				if (pcxt->worker[i].bgwhandle == NULL)
+					continue;
+
+				shm_mq_set_handle(node->pei->tqueue[i],
+								  pcxt->worker[i].bgwhandle);
+				node->reader[node->nreaders++] =
+					CreateTupleQueueReader(node->pei->tqueue[i],
+										   fslot->tts_tupleDescriptor);
+				got_any_worker = true;
+			}
+		}
+
+		/* No workers?  Then never mind. */
+		if (!got_any_worker)
+			ExecShutdownGatherWorkers(node);
+	}
+
+	/* Run plan locally if no workers or not single-copy. */
+	node->need_to_scan_locally = (node->reader == NULL)
+		|| !gather->single_copy;
+
+	node->early_start = false;
+	node->initialized = true;
+	return false;
+}
 
 /* ----------------------------------------------------------------
  *		ExecInitGather
@@ -58,6 +140,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	Plan	   *outerNode;
 	bool		hasoid;
 	TupleDesc	tupDesc;
+	int			child_eflags;
 
 	/* Gather node doesn't have innerPlan node. */
 	Assert(innerPlan(node) == NULL);
@@ -68,6 +151,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	gatherstate = makeNode(GatherState);
 	gatherstate->ps.plan = (Plan *) node;
 	gatherstate->ps.state = estate;
+	gatherstate->ps.fd = -1;
 	gatherstate->need_to_scan_locally = !node->single_copy;
 
 	/*
@@ -97,7 +181,12 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	/*
+	 * This outer plan is executed in another process so don't start
+	 * asynchronously in this process
+	 */
+	child_eflags = eflags & ~EXEC_FLAG_ASYNC;
+	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, child_eflags);
 
 	gatherstate->ps.ps_TupFromTlist = false;
 
@@ -115,6 +204,16 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	tupDesc = ExecTypeFromTL(outerNode->targetlist, hasoid);
 	ExecSetSlotDescriptor(gatherstate->funnel_slot, tupDesc);
 
+	/*
+	 * Register asynchronous execution callback for this node. Backend workers
+	 * needs to allocate large dynamic segment, and it is better to execute
+	 * them at the time of first execution from this aspect. So asynchronous
+	 * execution should be decided considering that but we omit the aspect for
+	 * now.
+	 */
+	if (eflags & EXEC_FLAG_ASYNC)
+		gatherstate->early_start = true;
+
 	return gatherstate;
 }
 
@@ -128,74 +227,14 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 TupleTableSlot *
 ExecGather(GatherState *node)
 {
-	TupleTableSlot *fslot = node->funnel_slot;
-	int			i;
 	TupleTableSlot *slot;
 	TupleTableSlot *resultSlot;
 	ExprDoneCond isDone;
 	ExprContext *econtext;
 
-	/*
-	 * Initialize the parallel context and workers on first execution. We do
-	 * this on first execution rather than during node initialization, as it
-	 * needs to allocate large dynamic segment, so it is better to do if it
-	 * is really needed.
-	 */
+	/* Initialize workers if not yet. */
 	if (!node->initialized)
-	{
-		EState	   *estate = node->ps.state;
-		Gather	   *gather = (Gather *) node->ps.plan;
-
-		/*
-		 * Sometimes we might have to run without parallelism; but if
-		 * parallel mode is active then we can try to fire up some workers.
-		 */
-		if (gather->num_workers > 0 && IsInParallelMode())
-		{
-			ParallelContext *pcxt;
-
-			/* Initialize the workers required to execute Gather node. */
-			if (!node->pei)
-				node->pei = ExecInitParallelPlan(node->ps.lefttree,
-												 estate,
-												 gather->num_workers);
-
-			/*
-			 * Register backend workers. We might not get as many as we
-			 * requested, or indeed any at all.
-			 */
-			pcxt = node->pei->pcxt;
-			LaunchParallelWorkers(pcxt);
-			node->nworkers_launched = pcxt->nworkers_launched;
-
-			/* Set up tuple queue readers to read the results. */
-			if (pcxt->nworkers_launched > 0)
-			{
-				node->nreaders = 0;
-				node->reader =
-					palloc(pcxt->nworkers_launched * sizeof(TupleQueueReader *));
-
-				for (i = 0; i < pcxt->nworkers_launched; ++i)
-				{
-					shm_mq_set_handle(node->pei->tqueue[i],
-									  pcxt->worker[i].bgwhandle);
-					node->reader[node->nreaders++] =
-						CreateTupleQueueReader(node->pei->tqueue[i],
-											   fslot->tts_tupleDescriptor);
-				}
-			}
-			else
-			{
-				/* No workers?  Then never mind. */
-				ExecShutdownGatherWorkers(node);
-			}
-		}
-
-		/* Run plan locally if no workers or not single-copy. */
-		node->need_to_scan_locally = (node->reader == NULL)
-			|| !gather->single_copy;
-		node->initialized = true;
-	}
+		ExecStartGather(node);
 
 	/*
 	 * Check to see if we're still projecting out tuples from a previous scan
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index dcf5175..33093e7 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -207,6 +207,7 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	grpstate = makeNode(GroupState);
 	grpstate->ss.ps.plan = (Plan *) node;
 	grpstate->ss.ps.state = estate;
+	grpstate->ss.ps.fd = -1;
 	grpstate->grp_done = FALSE;
 
 	/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 9ed09a7..f62b556 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -172,6 +172,7 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	hashstate = makeNode(HashState);
 	hashstate->ps.plan = (Plan *) node;
 	hashstate->ps.state = estate;
+	hashstate->ps.fd = -1;
 	hashstate->hashtable = NULL;
 	hashstate->hashkeys = NIL;	/* will be set by parent HashJoin */
 
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 369e666..ec54570 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -451,6 +451,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate = makeNode(HashJoinState);
 	hjstate->js.ps.plan = (Plan *) node;
 	hjstate->js.ps.state = estate;
+	hjstate->js.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..94b0193 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -403,6 +403,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexOnlyScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ps.fd = -1;
 	indexstate->ioss_HeapFetches = 0;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index bf16cb1..1beee6f 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -829,6 +829,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index faf32e1..6baf1c0 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -384,6 +384,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	limitstate = makeNode(LimitState);
 	limitstate->ps.plan = (Plan *) node;
 	limitstate->ps.state = estate;
+	limitstate->ps.fd = -1;
 
 	limitstate->lstate = LIMIT_INITIAL;
 
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 4ebcaff..42b2ff5 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -361,6 +361,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	lrstate = makeNode(LockRowsState);
 	lrstate->ps.plan = (Plan *) node;
 	lrstate->ps.state = estate;
+	lrstate->ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9ab03f3..db8279a 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -171,6 +171,7 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	matstate = makeNode(MaterialState);
 	matstate->ss.ps.plan = (Plan *) node;
 	matstate->ss.ps.state = estate;
+	matstate->ss.ps.fd = -1;
 
 	/*
 	 * We must have a tuplestore buffering the subplan output to do backward
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e271927..c5323d7 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -83,6 +83,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 */
 	mergestate->ps.plan = (Plan *) node;
 	mergestate->ps.state = estate;
+	mergestate->ps.fd = -1;
 	mergestate->mergeplans = mergeplanstates;
 	mergestate->ms_nplans = nplans;
 
@@ -112,6 +113,9 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
+		/* always request async execution for now */
+		eflags = eflags | EXEC_FLAG_ASYNC;
+
 		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
 		i++;
 	}
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 6db09b8..27ac84e 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1485,6 +1485,7 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	mergestate = makeNode(MergeJoinState);
 	mergestate->js.ps.plan = (Plan *) node;
 	mergestate->js.ps.state = estate;
+	mergestate->js.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e62c8aa..78df2e4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1561,6 +1561,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	mtstate->ps.plan = (Plan *) node;
 	mtstate->ps.state = estate;
 	mtstate->ps.targetlist = NIL;		/* not actually used */
+	mtstate->ps.fd = -1;
 
 	mtstate->operation = operation;
 	mtstate->canSetTag = node->canSetTag;
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 555fa09..c262d7f 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -309,6 +309,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate = makeNode(NestLoopState);
 	nlstate->js.ps.plan = (Plan *) node;
 	nlstate->js.ps.state = estate;
+	nlstate->js.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
@@ -340,11 +341,24 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
+
+	/*
+	 * async execution of outer plan is benetifical if this join is requested
+	 * as async
+	 */
 	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
+
+	/*
+	 * Async execution of the inner is inhibited if parameterized by the
+	 * outer
+	 */
+	if (list_length(node->nestParams) > 0)
+		eflags &= ~ EXEC_FLAG_ASYNC;
+
 	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
 
 	/*
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index e76405a..48a70cb 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -176,6 +176,7 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	rustate = makeNode(RecursiveUnionState);
 	rustate->ps.plan = (Plan *) node;
 	rustate->ps.state = estate;
+	rustate->ps.fd = -1;
 
 	rustate->eqfunctions = NULL;
 	rustate->hashfunctions = NULL;
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 4007b76..027b64e 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -217,6 +217,7 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	resstate = makeNode(ResultState);
 	resstate->ps.plan = (Plan *) node;
 	resstate->ps.state = estate;
+	resstate->ps.fd = -1;
 
 	resstate->rs_done = false;
 	resstate->rs_checkqual = (node->resconstantqual == NULL) ? false : true;
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 9ce7c02..a670e77 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -152,6 +152,7 @@ ExecInitSampleScan(SampleScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SampleScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f12921d..86a3015 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -39,6 +39,20 @@ static TupleTableSlot *SeqNext(SeqScanState *node);
  * ----------------------------------------------------------------
  */
 
+bool
+ExecStartSeqScan(SeqScanState *node)
+{
+	if (node->early_start)
+	{
+		ereport(LOG,
+				(errmsg("dummy_async_cb is called for %p@ExecStartSeqScan", node),
+				 errhidestmt(true)));
+		node->early_start = false;
+	}
+
+	return false;
+}
+
 /* ----------------------------------------------------------------
  *		SeqNext
  *
@@ -177,6 +191,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SeqScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
@@ -214,6 +229,10 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	ExecAssignResultTypeFromTL(&scanstate->ss.ps);
 	ExecAssignScanProjectionInfo(&scanstate->ss);
 
+	/*  Do early-start when requested */
+	if (eflags & EXEC_FLAG_ASYNC)
+		scanstate->early_start = true;
+
 	return scanstate;
 }
 
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 2d81d46..8eafd91 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -487,6 +487,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	setopstate = makeNode(SetOpState);
 	setopstate->ps.plan = (Plan *) node;
 	setopstate->ps.state = estate;
+	setopstate->ps.fd = -1;
 
 	setopstate->eqfunctions = NULL;
 	setopstate->hashfunctions = NULL;
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a34dcc5..f28dc2d 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -162,6 +162,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate = makeNode(SortState);
 	sortstate->ss.ps.plan = (Plan *) node;
 	sortstate->ss.ps.state = estate;
+	sortstate->ss.ps.fd = -1;
 
 	/*
 	 * We must have random access to the sort output to do backward scan or
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 0304b15..c2b9bb0 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -117,6 +117,7 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	subquerystate = makeNode(SubqueryScanState);
 	subquerystate->ss.ps.plan = (Plan *) node;
 	subquerystate->ss.ps.state = estate;
+	subquerystate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 2604103..41d69c3 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -461,6 +461,7 @@ ExecInitTidScan(TidScan *node, EState *estate, int eflags)
 	tidstate = makeNode(TidScanState);
 	tidstate->ss.ps.plan = (Plan *) node;
 	tidstate->ss.ps.state = estate;
+	tidstate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 4caae34..56c21e8 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -122,6 +122,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	uniquestate = makeNode(UniqueState);
 	uniquestate->ps.plan = (Plan *) node;
 	uniquestate->ps.state = estate;
+	uniquestate->ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 2c4bd9c..2ec3ed7 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -205,6 +205,7 @@ ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags)
 	scanstate = makeNode(ValuesScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f06eebe..bc5b9ce 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1787,6 +1787,7 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	winstate = makeNode(WindowAggState);
 	winstate->ss.ps.plan = (Plan *) node;
 	winstate->ss.ps.state = estate;
+	winstate->ss.ps.fd = -1;
 
 	/*
 	 * Create expression contexts.  We need two, one for per-input-tuple
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index cfed6e6..230c849 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -144,6 +144,7 @@ ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags)
 	scanstate = makeNode(WorkTableScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.fd = -1;
 	scanstate->rustate = NULL;	/* we'll set this later */
 
 	/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 44fac27..de78d04 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -62,6 +62,7 @@
 #define EXEC_FLAG_WITH_OIDS		0x0020	/* force OIDs in returned tuples */
 #define EXEC_FLAG_WITHOUT_OIDS	0x0040	/* force no OIDs in returned tuples */
 #define EXEC_FLAG_WITH_NO_DATA	0x0080	/* rel scannability doesn't matter */
+#define EXEC_FLAG_ASYNC			0x0100	/* request asynchronous execution */
 
 
 /*
@@ -224,6 +225,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
+extern bool ExecStartNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
index f76d9be..0a48a03 100644
--- a/src/include/executor/nodeGather.h
+++ b/src/include/executor/nodeGather.h
@@ -18,6 +18,7 @@
 
 extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecGather(GatherState *node);
+extern bool ExecStartGather(GatherState *node);
 extern void ExecEndGather(GatherState *node);
 extern void ExecShutdownGather(GatherState *node);
 extern void ExecReScanGather(GatherState *node);
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index f2e61ff..daf54ac 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -19,6 +19,7 @@
 
 extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
 extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern bool ExecStartSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ee4e189..205a2c8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -20,6 +20,7 @@
 #include "lib/pairingheap.h"
 #include "nodes/params.h"
 #include "nodes/plannodes.h"
+#include "storage/latch.h"
 #include "utils/reltrigger.h"
 #include "utils/sortsupport.h"
 #include "utils/tuplestore.h"
@@ -345,6 +346,14 @@ typedef struct ResultRelInfo
 	List	   *ri_onConflictSetWhere;
 } ResultRelInfo;
 
+/* Enum for async awareness */
+typedef enum NodeStatus
+{
+	EXEC_NOT_READY,
+	EXEC_READY,
+	EXEC_EOT
+} NodeStatus;
+
 /* ----------------
  *	  EState information
  *
@@ -1059,6 +1068,9 @@ typedef struct PlanState
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 	bool		ps_TupFromTlist;/* state flag for processing set-valued
 								 * functions in targetlist */
+
+	NodeStatus	exec_status;
+	int			fd;
 } PlanState;
 
 /* ----------------
@@ -1138,6 +1150,14 @@ typedef struct ModifyTableState
 										 * target */
 } ModifyTableState;
 
+
+typedef enum AppendAsyncState
+{
+	ASYNCCHILD_READY,
+	ASYNCCHILD_NOT_READY,
+	ASYNCCHILD_DONE
+} AppendAsyncState;
+
 /* ----------------
  *	 AppendState information
  *
@@ -1149,8 +1169,10 @@ typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
+	AppendAsyncState   *async_state;
 	int			as_nplans;
 	int			as_whichplan;
+	WaitEventSet *evset;
 } AppendState;
 
 /* ----------------
@@ -1259,6 +1281,7 @@ typedef struct SeqScanState
 {
 	ScanState	ss;				/* its first field is NodeTag */
 	Size		pscan_len;		/* size of parallel heap scan descriptor */
+	bool		early_start;
 } SeqScanState;
 
 /* ----------------
@@ -1952,6 +1975,7 @@ typedef struct UniqueState
 typedef struct GatherState
 {
 	PlanState	ps;				/* its first field is NodeTag */
+	bool		early_start;
 	bool		initialized;
 	struct ParallelExecutorInfo *pei;
 	int			nreaders;
#12Rajeev rastogi
rajeev.rastogi@huawei.com
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

On 09 May 2016 23:04, Robert Haas Wrote:

2. vectorized execution, by which I mean the ability of a node to return
tuples in batches rather than one by one. Andres has opined more than
once that repeated trips through ExecProcNode defeat the ability of the
CPU to do branch prediction correctly, slowing the whole system down,
and that they also result in poor CPU cache behavior, since we jump all
over the place executing a little bit of code from each node before
moving onto the next rather than running one bit of code first, and then
another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in the
first hash table, and then probe for all of those in the second hash
table, and so on. What we do instead is fetch one tuple and probe for
it in all 5 hash tables, and then repeat. If one of those hash tables
would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse. But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next tuple
from the probe table means calling a function which in turn calls
another function which probably calls another function which probably
calls another function and now about 4 layers down we actually get the
next tuple. If the scan returned a batch of tuples to the hash join,
fetching the next tuple from the batch would probably be 0 or 1 function
calls rather than ... more. Admittedly, you've got to consider the cost
of marshaling the batches but I'm optimistic that there are cycles to be
squeezed out here. We might also want to consider storing batches of
tuples in a column-optimized rather than row-optimized format so that
iterating through one or two attributes across every tuple in the batch
touches the minimal number of cache lines.

This sounds to be really great idea in the direction of performance improvement.
I would like to share my thought as per our research work in the similar area (Mostly it may be as you have mentioned).
Our goal with this work was to:
1. Makes the processing data-centric instead of operator centric.
2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed to push until it sees any operator, which cannot be processed without result from other operator.
3. Above two points to make better data-locality.

e.g. we had done some quick prototyping (take it just as thought provoker) as mentioned below:
Query: select * from tbl1, tbl2, tbl3 where tbl1.a=tbl2.b and tbl2.b=tbl3.c;
For hash join:
For each tuple t2 of tbl2
Materialize a hash tbl1.a = tbl2.b

For each tuple t3 of tbl3
Materialize a hash tbl2.b = tbl3.c

for each tuple t1 of tbl1
Search in hash tbl1.a = tbl2.b
Search in hash tbl2.b = tbl3.c
Output t1*t2*t3

Off course at each level if there is any additional Qual for the table, same can be applied.

Similarly for Nested Loop Join, plan tree can be processed in the form of post-order traversal of tree.
Scan first relation (leftmost relation), store all tuple --> Outer
Loop through all scan (Or some part of total tuples)node relation starting from second one
Scan the current relation
For each tuple, match with all tuples of outer result, build the combined tuple.
Save all combined satisfied tuple --> Outer

The result we got was really impressive.

There is a paper by Thomas Neumann on this idea: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

Note: VitesseDB has also implemented this approach for Hash Join along with compilation and their result is really impressive.

Thanks and Regards,
Kumar Rajeev Rastogi.
http://rajeevrastogi.blogspot.in/

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#3)
Re: asynchronous and vectorized execution

On Mon, May 9, 2016 at 8:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

It's interesting that you mention this. We identified this as a pain
point during our work on column stores last year. Simply passing
single tuples around the executor is really unfriendly towards L1
instruction cache, plus also the points you mention about L3 cache and
hash tables and tuple stores. I really think that we're likely to see
significant gains by processing >1 tuple at a time, so this topic very
much interests me.

Cool. I hope we can work together on it, and with anyone else who is
interested.

When we start multiplying those increases with the increases with
something like parallel query then we're starting to see very nice
gains in performance.

Check.

Alvaro, Tomas and I had been discussing this and late last year I did
look into what would be required to allow this to happen in Postgres.
Basically there's 2 sub-projects, I'll describe what I've managed to
learn so far about each, and the rough plan that I have to implement
them:

1. Batch Execution:

a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
indicate if the struct contains a single or a multiple tuples.
Multiple tuples may need to be deformed in a non-lazy fashion in order
to prevent too many buffers from having to be pinned at once. Tuples
will be deformed into arrays of each column rather than arrays for
each tuple (this part is important to support the next sub-project)
c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.

I'm wondering if we should instead have a whole new kind of node, a
TupleTableVector, say. Things that want to work one tuple at a time
can continue to do so with no additional overhead. Things that want
to return batches can do it via this new result type.

node types, which *may* not be all that difficult. I'm also assuming
that batch mode (in all cases apart from queries with LIMIT or
cursors) will always be faster than tuple-at-a-time, so requires no
costings from the planner.

I definitely agree that we need to consider cases with and without
LIMIT separately, but there's more than one way to get a LIMIT. For
example, a subquery returns only one row; a semijoin returns only one
row. I don't think you are going to escape planner considerations.

Nested Loop Semi Join
-> Seq Scan
-> Index Scan on dont_batch_here

2. Vector processing

(I admit that I've given this part much less thought so far, but
here's what I have in mind)

This depends on batch execution, and is intended to allow the executor
to perform function calls to an entire batch at once, rather than
tuple-at-a-time. For example, let's take the following example;

SELECT a+b FROM t;

here (as of now) we'd scan "t" one row at a time and perform a+b after
having deformed enough of the tuple to do that. We'd then go and get
another Tuple from the scan node and repeat until the scan gave us no
more Tuples.

With batch execution we'd fetch multiple Tuples from the scan and we'd
then perform the call to say int4_pl() multiple times, which still
kinda sucks as it means calling int4_pl() possibly millions of times
(once per tuple). The vector mode here would require that we modify
pg_operator to add a vector function for each operator so that we can
call the function passing in an array of Datums and a length to have
SIMD operations perform the addition, so we'd call something like
int4_pl_vector() only once per batch of tuples allowing the CPU to
perform SIMD operations on those datum arrays. This could be done in
an incremental way as the code could just callback on the standard
function in cases where a vectorised version of it is not available.
Thought is needed here about when exactly this decision is made as the
user may not have permissions to execute the vector function, so it
can't simply be a run time check. These functions would simply return
another vector of the results. Aggregates could be given a vector
transition function, where something like COUNT(*)'s vector_transfn
would simply just current_count += vector_length;

This project does appear to require that we bloat the code with 100's
of vector versions of each function. I'm not quite sure if there's a
better way to handle this. The problem is that the fmgr is pretty much
a barrier to SIMD operations, and this was the only idea that I've had
so far about breaking through that barrier. So further ideas here are
very welcome.

I don't have any at the moment, but I'm not keen on hundreds of new
vector functions that can all have bugs or behavior differences versus
the unvectorized versions of the same code. That's a substantial tax
on future development. I think it's important to understand what
sorts of queries we are targeting here. KaiGai's GPU-acceleration
stuff does great on queries with complex WHERE clauses, but most
people don't care not only because it's out-of-core but because who
actually looks for the records where (a + b) % c > (d + e) * f / g?
This seems like it has the same issue. If we can speed up common
queries people are actually likely to run, OK, that's interesting.

By the way, I think KaiGai's GPU-acceleration stuff points to another
pitfall here. There's other stuff somebody might legitimately want to
do that requires another copy of each function. For example, run-time
code generation likely needs that (a function to tell the code
generator what to generate for each of our functions), and
GPU-acceleration probably does, too. If fixing a bug in numeric_lt
requires changing not only the regular version and the vectorized
version but also the GPU-accelerated version and the codegen version,
Tom and Dean are going to kill us. And justifiably so! Granted,
nobody is proposing those other features in core right now, but
they're totally reasonable things to want to do.

I suspect the number of queries that are being hurt by fmgr overhead
is really large, and I think it would be nice to attack that problem
more directly. It's a bit hard to discuss what's worthwhile in the
abstract, without performance numbers, but when you vectorize, how
much is the benefit from using SIMD instructions and how much is the
benefit from just not going through the fmgr every time?

The idea here is that these 2 projects help pave the way to bring
columnar storage into PostgreSQL. Without these we're unlikely to get
much benefit of columnar storage as we'd be stuck processing rows one
at a time still. Adding columnar storage on the top of the above
should further increase performance as we can skip the tuple-deform
step and pull columnar array/vectors directly into a TupleTableSlot,
although some trickery would be involved here when the scan has keys.

I'm a bit mystified by this. It seems to me that you could push down
the optimizable quals into the AM, just like what index AMs due for
Index Quals and what postgres_fdw does for pushdown-safe quals. Then
those quals get executed on the optimized representation, and you only
have to fill TupleTableSlots for the surviving tuples. AFAICS,
vectorizing the core executor only helps if you want to keep the data
in vectorized form for longer, e.g. to somehow optimize joins or aggs,
or if the data starts out in row-oriented form and we convert it to
columnar form before doing vector ops. Evidently I'm confused.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#4)
Re: asynchronous and vectorized execution

On Mon, May 9, 2016 at 9:38 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Is the parallel aware Append node sufficient to run multiple nodes
asynchronously? (Sorry, I couldn't have enough time to code the feature
even though we had discussion before.)

It's tempting to think that parallel query and asynchronous query are
the same thing, but I think that they are actually quite different.
Parallel query involves using multiple processes to service a query.
Asynchronous query involves using each individual process as
efficiently as possible by not having it block any more than
necessary. You can want these things together or separately. For
example, consider this query plan:

Append
-> ForeignScan
-> ForeignScan

Here, you do not want parallel query; the queries must both be
launched by the user backend, not some worker process, else you will
not get the right transaction semantics. The parallel-aware Append
node we talked about before does not help here. On the other hand,
consider this:

Append
-> Seq Scan
Filter: lots_of_cpu()
-> Seq Scan
Filter: lots_of_cpu()

Here, asynchronous query is of no help, but parallel query is great.
Now consider this hypothetical plan:

Gather
-> Hash Join
-> Parallel Bitmap Heap Scan
-> Bitmap Index Scan
-> Parallel Hash
-> Parallel Seq Scan

Let's assume that the bitmap *heap* scan can be performed in parallel
but the bitmap *index* scan can't. That's pretty reasonable for a
first cut, actually, because the index accesses are probably touching
much less data, and we're doing little CPU work for each index page -
any delay here is likely to be I/O.

So in that world what you want is for the first worker to begin
performing the bitmap index scan and building a shared TIDBitmap for
that the workers can use to scan the heap. The other workers,
meanwhile, could begin building the shared hash table (which is what I
intend to denote by saying that it's a *Parallel* Hash). If the
process building the bitmap finishes before the hash table is built,
it can help build the hash table as well. Once both operations are
done, each process can scan a subset of the rows from the bitmap heap
scan and do the hash table probes for just those rows. To make all of
this work, you need both *parallel* query, so that you have workers,
and also *asynchronous* query, so that workers which see that the
bitmap index scan is in progress don't get stuck waiting for it but
can look around for other work.

In the above example, scan on foreign-table takes longer lead time than
local scan. If Append can map every child nodes on individual workers,
local scan worker begins to return tuples at first, then, mixed tuples
shall be returned eventually.

This is getting at an issue I'm somewhat worried about, which is
scheduling. In my prototype, we only check for events if there are no
nodes ready for the CPU now, but sometimes that might be a loser. One
probably needs to check for events periodically even when there are
still nodes waiting for the CPU, but "how often?" seems like an
unanswerable question.

However, the process internal asynchronous execution may be also beneficial
in case when cost of shm_mq is not ignorable (e.g, no scan qualifiers
are given to worker process). I think it allows to implement pre-fetching
very naturally.

Yes.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple.

Backward compatibility is good. In addition, child node may want to
know the context when it is called. It may want to switch the behavior
according to the caller's expectation. For example, it may be beneficial
if SeqScan makes more aggressive prefetching on asynchronous execution.

Maybe, but I'm a bit doubtful. I'm not seeing a lot of advantage in
that sort of thing, and it will make the code a lot more complicated.

Also, can we consider which data format will be returned from the child
node during the planning stage? It affects to the cost of inter-node
data exchange. If a pair of parent-node and child-node supports its
special data format (like as existing HashJoin and Hash doing), it shall
be a discount factor of cost estimation.

I'm not sure. The costing aspects of this need a lot more thought
than I have given them so far.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Robert Haas
robertmhaas@gmail.com
In reply to: konstantin knizhnik (#10)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
<k.knizhnik@postgrespro.ru> wrote:

What's wrong with it that worker is blocked? You can just have more workers
(more than CPU cores) to let other of them continue to do useful work.

Not really. The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next. If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation
native code for query execution plan (which is not so difficult now, thanks
to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL.
Instead of native code generation I used conversion of query plans to C code
and experiment with different data representation. "Horisontal model" with
loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.

At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...

Interesting. You may notice that in "asynchronous mode" my prototype
works using a push model of sorts. Maybe that should be taken
further.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#15)
Re: asynchronous and vectorized execution

On 05/10/2016 08:26 PM, Robert Haas wrote:

On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
<k.knizhnik@postgrespro.ru> wrote:

What's wrong with it that worker is blocked? You can just have more workers
(more than CPU cores) to let other of them continue to do useful work.

Not really. The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next. If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.

Doesn't this actually mean that we need to have normal job scheduler which is given queue of jobs and having some pool of threads will be able to orginize efficient execution of queries? Optimizer can build pipeline (graph) of tasks, which corresponds to
execution plan nodes, i.e. SeqScan, Sort, ... Each task is splitted into several jobs which can be concurretly scheduled by task dispatcher. So you will not have blocked worker waiting for something and all system resources will be utilized. Such approach
with dispatcher allows to implement quotas, priorities,... Also dispatches can care about NUMA and cache optimizations which is especially critical on modern architectures. One more reference: http://db.in.tum.de/~leis/papers/morsels.pdf

Sorry, may be I wrong, but I still think that async.ops is "multitasking for poor":)
Yes, maintaining threads and especially separate processes adds significant overhead. It leads to extra resources consumption and context switches are quite expensive. And I know from my own experience that replacing several concurrent processes performing
some IO (for example with sockets) with just one process using multiplexing allows to increase performance. But still async. ops. is just a way to make programmer responsible for managing state machine instead of relying on OS tomake context switches.
Manual transmission is still more efficient than automatic transmission. But still most drives prefer last one;)

Seriously, I carefully read your response to Kochei, but still not convinced that async. ops. is what we need. Or may be we just understand different thing by this notion.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation
native code for query execution plan (which is not so difficult now, thanks
to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL.
Instead of native code generation I used conversion of query plans to C code
and experiment with different data representation. "Horisontal model" with
loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.

At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...

Interesting. You may notice that in "asynchronous mode" my prototype
works using a push model of sorts. Maybe that should be taken
further.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Kouhei Kaigai (#7)
Re: asynchronous and vectorized execution

On 5/10/16 12:47 AM, Kouhei Kaigai wrote:

On 10 May 2016 at 13:38, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

My concern about ExecProcNode is, it is constructed with a large switch
... case statement. It involves tons of comparison operation at run-time.
If we replace this switch ... case by function pointer, probably, it make
performance improvement. Especially, OLAP workloads that process large
amount of rows.

I imagined that any decent compiler would have built the code to use
jump tables for this. I have to say that I've never checked to make
sure though.

Ah, indeed, you are right. Please forget above part.

Even so, I would think that the simplification in the executor would be
worth it. If you need to add a new node there's dozens of places where
you might have to mess with these giant case statements.

In python (for example), types (equivalent to nodes in this case) have
data structures that define function pointers for a core set of
operations (such as doing garbage collection, or generating a string
representation). That means that to add a new type at the C level you
only need to define a C structure that has the expected members, and an
initializer function that will properly set everything up when you
create a new instance. There's no messing around with the rest of the
guts of python.

*Even more important, everything you need to know about the type is
contained in one place, not spread throughout the code.*
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Bert
biertie@gmail.com
In reply to: Konstantin Knizhnik (#16)
Re: asynchronous and vectorized execution

hmm, the morsels paper looks really interesting at first sight.
Let's see if we can get a poc working in PostgreSQL? :-)

On Tue, May 10, 2016 at 9:42 PM, Konstantin Knizhnik <
k.knizhnik@postgrespro.ru> wrote:

On 05/10/2016 08:26 PM, Robert Haas wrote:

On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
<k.knizhnik@postgrespro.ru> wrote:

What's wrong with it that worker is blocked? You can just have more
workers
(more than CPU cores) to let other of them continue to do useful work.

Not really. The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next. If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.

Doesn't this actually mean that we need to have normal job scheduler which
is given queue of jobs and having some pool of threads will be able to
orginize efficient execution of queries? Optimizer can build pipeline
(graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
Sort, ... Each task is splitted into several jobs which can be concurretly
scheduled by task dispatcher. So you will not have blocked worker waiting
for something and all system resources will be utilized. Such approach with
dispatcher allows to implement quotas, priorities,... Also dispatches can
care about NUMA and cache optimizations which is especially critical on
modern architectures. One more reference:
http://db.in.tum.de/~leis/papers/morsels.pdf

Sorry, may be I wrong, but I still think that async.ops is "multitasking
for poor":)
Yes, maintaining threads and especially separate processes adds
significant overhead. It leads to extra resources consumption and context
switches are quite expensive. And I know from my own experience that
replacing several concurrent processes performing some IO (for example with
sockets) with just one process using multiplexing allows to increase
performance. But still async. ops. is just a way to make programmer
responsible for managing state machine instead of relying on OS tomake
context switches. Manual transmission is still more efficient than
automatic transmission. But still most drives prefer last one;)

Seriously, I carefully read your response to Kochei, but still not
convinced that async. ops. is what we need. Or may be we just understand
different thing by this notion.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation
native code for query execution plan (which is not so difficult now,
thanks
to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL.
Instead of native code generation I used conversion of query plans to C
code
and experiment with different data representation. "Horisontal model"
with
loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.

At this moment (February) them have implemented translation of only few

PostgreSQL operators used by ExecQuals and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code
for
aggregates, because ExecQual itself takes only 6% of time for this
query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor
transformation.
As far as I know, them are going to publish result of their work to open
source...

Interesting. You may notice that in "asynchronous mode" my prototype
works using a push model of sorts. Maybe that should be taken
further.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Bert Desmet
0477/305361

#19Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

Hi,

On 2016-05-09 13:33:55 -0400, Robert Haas wrote:

I think there are several different areas
where we should consider major upgrades to our executor. It's too
slow and it doesn't do everything we want it to do. The main things
on my mind are:

3) We use a lot of very cache-inefficient datastructures.

Especially the pervasive use of linked lists in the executor is pretty
bad for performance. Every element is likely to incur cache misses,
every list element pretty much has it's own cacheline (thereby reducing
overall cache hit ratio), they have a horrible allocation overhead (both
space and palloc runtime).

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. [...]. It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree. For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later.

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant. Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.

I'd personally say that the CPU pipeline defeating aspect is worse than
the effect of the cache/branch misses themselves. Today's CPUs are
heavily superscalar, and our instruction-per-cycle throughput shows
pretty clearly that we're not good at employing (micro-)instruction
paralellism. We're quite frequently at well below one instruction/cycle.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple. Similarly, for vectorized execution,
a node might return a bunch of tuples all at once. ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted. In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today.

I agree that that generally is a reasonable way forward.

Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities. They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it. If not, ExecProcNode will paper over the problem. This
seems to me to be a good way to keep the code simple.

Maybe not permission, but for some cases it seems to be important to
hint to *not* prefetch a lot of rows. E.g. anti joins come to mind. Just
using batching with force seems likely to regress some queries quite
badly (e.g an expensive join inside an EXISTS() which returns many
tuples).

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them.

What different types of results are you envisioning?

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
/*
* Note that unlike IndexScan, SeqScan never use keys in heap_beginscan
* (and this is very bad) - so, here we do not check are keys ok or not.
*/
return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster.

I can immediately believe that.

Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Ants Aasma
ants.aasma@eesti.ee
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 9, 2016 at 8:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
I don't have any at the moment, but I'm not keen on hundreds of new
vector functions that can all have bugs or behavior differences versus
the unvectorized versions of the same code. That's a substantial tax
on future development. I think it's important to understand what
sorts of queries we are targeting here. KaiGai's GPU-acceleration
stuff does great on queries with complex WHERE clauses, but most
people don't care not only because it's out-of-core but because who
actually looks for the records where (a + b) % c > (d + e) * f / g?
This seems like it has the same issue. If we can speed up common
queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

By the way, I think KaiGai's GPU-acceleration stuff points to another
pitfall here. There's other stuff somebody might legitimately want to
do that requires another copy of each function. For example, run-time
code generation likely needs that (a function to tell the code
generator what to generate for each of our functions), and
GPU-acceleration probably does, too. If fixing a bug in numeric_lt
requires changing not only the regular version and the vectorized
version but also the GPU-accelerated version and the codegen version,
Tom and Dean are going to kill us. And justifiably so! Granted,
nobody is proposing those other features in core right now, but
they're totally reasonable things to want to do.

My thoughts in this area have been circling around getting LLVM to do
the heavy lifting. LLVM/clang could compile existing C functions to IR
and bundle those with the DB. At query planning time or maybe even
during execution the functions can be inlined into the compiled query
plan, LLVM can then be coaxed to copy propagate, constant fold and
dead code eliminate the bejeezus out of the expression tree. This way
duplication of the specialized code can be kept to a minimum while at
least the common cases can completely avoid the fmgr overhead.

This approach would also mesh together fine with batching. Given
suitably regular data structures and simple functions, LLVM will be
able to vectorize the code. If not it will still end up with a nice
tight loop that is an order of magnitude or two faster than the
current executor.

The first cut could take care of ExecQual, ExecTargetList and friends.
Later improvements could let execution nodes provide basic blocks that
would then be threaded together into the main execution loop. If some
node does not implement the basic block interface a default
implementation is used that calls the current interface. It gets a bit
handwavy at this point, but the main idea would be to enable data
marshaling so that values can be routed directly to the code that
needs them without being written to intermediate storage.

I suspect the number of queries that are being hurt by fmgr overhead
is really large, and I think it would be nice to attack that problem
more directly. It's a bit hard to discuss what's worthwhile in the
abstract, without performance numbers, but when you vectorize, how
much is the benefit from using SIMD instructions and how much is the
benefit from just not going through the fmgr every time?

My feeling is the same, fmgr overhead and data marshaling, dynamic
dispatch through the executor is the big issue. This is corroborated
by what I have seen found by other VM implementations. Once you get
the data into an uniform format where vectorized execution could be
used, the CPU execution resources are no longer the bottleneck. Memory
bandwidth gets in the way, unless each input value is used in multiple
calculations. And even then, we are looking at a 4x speedup at best.

Regards,
Ants Aasma

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21Andres Freund
andres@anarazel.de
In reply to: David Rowley (#3)
Re: asynchronous and vectorized execution

On 2016-05-10 12:34:19 +1200, David Rowley wrote:

a. Modify ScanAPI to allow batch tuple fetching in predefined batch sizes.
b. Modify TupleTableSlot to allow > 1 tuple to be stored. Add flag to
indicate if the struct contains a single or a multiple tuples.
Multiple tuples may need to be deformed in a non-lazy fashion in order
to prevent too many buffers from having to be pinned at once. Tuples
will be deformed into arrays of each column rather than arrays for
each tuple (this part is important to support the next sub-project)

FWIW, I don't think that's necessarily required, and it has the
potential to slow down some operations (like target list
processing/projections) considerably. By the time vectored execution for
postgres is ready, gather instructions should be common and fast enough
(IIRC they started to be ok with broadwells, and are better in skylake;
other archs had them for longer).

c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.

It doesn't really need to be the entire tree. Even if you have a subtree
(say a parametrized index nested loop join) which doesn't support batch
mode, you'll likely still see performance benefits by building a batch
one layer above the non-batch-supporting node.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#13)
Re: asynchronous and vectorized execution

On 2016-05-10 12:56:17 -0400, Robert Haas wrote:

I suspect the number of queries that are being hurt by fmgr overhead
is really large, and I think it would be nice to attack that problem
more directly. It's a bit hard to discuss what's worthwhile in the
abstract, without performance numbers, but when you vectorize, how
much is the benefit from using SIMD instructions and how much is the
benefit from just not going through the fmgr every time?

I think fmgr overhead is an issue, but in most profiles of execution
heavy loads I've seen the bottlenecks are elsewhere. They often seem to
roughly look like
+   15.47%  postgres  postgres           [.] slot_deform_tuple
+   12.99%  postgres  postgres           [.] slot_getattr
+   10.36%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+    9.76%  postgres  postgres           [.] heap_getnext
+    6.34%  postgres  postgres           [.] HeapTupleSatisfiesMVCC
+    5.09%  postgres  postgres           [.] heapgetpage
+    4.59%  postgres  postgres           [.] hash_search_with_hash_value
+    4.36%  postgres  postgres           [.] ExecQual
+    3.30%  postgres  postgres           [.] ExecStoreTuple
+    3.29%  postgres  postgres           [.] ExecScan

or

-   33.67%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
   - ExecMakeFunctionResultNoSets
      + 99.11% ExecEvalOr
      + 0.89% ExecQual
+   14.32%  postgres  postgres           [.] slot_getattr
+    5.66%  postgres  postgres           [.] ExecEvalOr
+    5.06%  postgres  postgres           [.] check_stack_depth
+    5.02%  postgres  postgres           [.] slot_deform_tuple
+    4.05%  postgres  postgres           [.] pgstat_end_function_usage
+    3.69%  postgres  postgres           [.] heap_getnext
+    3.41%  postgres  postgres           [.] ExecEvalScalarVarFast
+    3.36%  postgres  postgres           [.] ExecEvalConst

with a healthy dose of _bt_compare, heap_hot_search_buffer in more index
heavy workloads.

(yes, I just pulled these example profiles from somewhere, but I've more
often seen them look like this, than very fmgr heavy).

That seems to suggest that we need to restructure how we get to calling
fmgr functions, before worrying about the actual fmgr call.

Tomas, Mark, IIRC you'd both generated perf profiles for TPC-H (IIRC?)
queries at some point. Any chance the results are online somewhere?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Andres Freund
andres@anarazel.de
In reply to: Ants Aasma (#20)
Re: asynchronous and vectorized execution

On 2016-05-11 03:20:12 +0300, Ants Aasma wrote:

On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 9, 2016 at 8:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
I don't have any at the moment, but I'm not keen on hundreds of new
vector functions that can all have bugs or behavior differences versus
the unvectorized versions of the same code. That's a substantial tax
on future development. I think it's important to understand what
sorts of queries we are targeting here. KaiGai's GPU-acceleration
stuff does great on queries with complex WHERE clauses, but most
people don't care not only because it's out-of-core but because who
actually looks for the records where (a + b) % c > (d + e) * f / g?
This seems like it has the same issue. If we can speed up common
queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

I've seen that as well, but Was it the actual fmgr indirection causing
the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24Ants Aasma
ants.aasma@eesti.ee
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

On Wed, May 11, 2016 at 3:52 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-11 03:20:12 +0300, Ants Aasma wrote:

On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, May 9, 2016 at 8:34 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
I don't have any at the moment, but I'm not keen on hundreds of new
vector functions that can all have bugs or behavior differences versus
the unvectorized versions of the same code. That's a substantial tax
on future development. I think it's important to understand what
sorts of queries we are targeting here. KaiGai's GPU-acceleration
stuff does great on queries with complex WHERE clauses, but most
people don't care not only because it's out-of-core but because who
actually looks for the records where (a + b) % c > (d + e) * f / g?
This seems like it has the same issue. If we can speed up common
queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

I've seen that as well, but Was it the actual fmgr indirection causing
the overhead, or was it ExecQual/ExecMakeFunctionResultNoSets et al?

I don't remember what the exact profile looked like, but IIRC it was
mostly Exec* stuff with advance_aggregates also up there.

Regards,
Ants Aasma

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Robert Haas
robertmhaas@gmail.com
In reply to: Jim Nasby (#17)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 4:57 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

Even so, I would think that the simplification in the executor would be
worth it. If you need to add a new node there's dozens of places where you
might have to mess with these giant case statements.

Dozens? I think the number is in the single digits.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26Robert Haas
robertmhaas@gmail.com
In reply to: Konstantin Knizhnik (#16)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 3:42 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Doesn't this actually mean that we need to have normal job scheduler which
is given queue of jobs and having some pool of threads will be able to
orginize efficient execution of queries? Optimizer can build pipeline
(graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
Sort, ... Each task is splitted into several jobs which can be concurretly
scheduled by task dispatcher. So you will not have blocked worker waiting
for something and all system resources will be utilized. Such approach with
dispatcher allows to implement quotas, priorities,... Also dispatches can
care about NUMA and cache optimizations which is especially critical on
modern architectures. One more reference:
http://db.in.tum.de/~leis/papers/morsels.pdf

I read this as a proposal to redesign the entire optimizer and
executor to use some new kind of plan. That's not a project I'm
willing to entertain; it is hard to imagine we could do it in a
reasonable period of time without introducing bugs and performance
regressions. I think there is a great deal of performance benefit
that we can get by changing things incrementally.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#19)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 7:57 PM, Andres Freund <andres@anarazel.de> wrote:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. [...]. It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree. For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.

What I'm talking about is a query like this:

SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true;

What we do today is run the remote query on the first child table to
completion, then start it on the second child table, and so on.
Sending all the queries at once can bring a speed-up of a factor of N
to a query with N children, and it's completely independent of every
other speed-up that we might attempt. This has been under discussion
for years on FDW-related threads as a huge problem that we need to fix
someday, and I really don't see how it's sane not to try. The shape
of what that looks like is of course arguable, but saying the
optimization isn't valuable blows my mind.

Whether you care about this case or not, this is also important for
parallel query.

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant. Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.

Yeah. I think we need a system for signalling nodes as to when they
will be run to completion. But a Boolean is somehow unsatisfying;
LIMIT 1000000 is more like no LIMIT than it it is like LIMIT 1. I'm
tempted to add a numTuples field to every ExecutorState and give upper
nodes some way to set it, as a hint.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them.

What different types of results are you envisioning?

TupleTableSlots and TupleTableVectors, mostly. I think the stuff that
is currently going through MultiExecProcNode() could probably be
folded in as just another type of result.

Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

I think we need a lock to examine MVCC visibility information. A pin
is enough to prevent a tuple from being removed, but not from having
its xmax and cmax overwritten at almost but not quite exactly the same
time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#26)
Re: asynchronous and vectorized execution

On 11.05.2016 17:00, Robert Haas wrote:

On Tue, May 10, 2016 at 3:42 PM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Doesn't this actually mean that we need to have normal job scheduler which
is given queue of jobs and having some pool of threads will be able to
orginize efficient execution of queries? Optimizer can build pipeline
(graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
Sort, ... Each task is splitted into several jobs which can be concurretly
scheduled by task dispatcher. So you will not have blocked worker waiting
for something and all system resources will be utilized. Such approach with
dispatcher allows to implement quotas, priorities,... Also dispatches can
care about NUMA and cache optimizations which is especially critical on
modern architectures. One more reference:
http://db.in.tum.de/~leis/papers/morsels.pdf

I read this as a proposal to redesign the entire optimizer and
executor to use some new kind of plan. That's not a project I'm
willing to entertain; it is hard to imagine we could do it in a
reasonable period of time without introducing bugs and performance
regressions. I think there is a great deal of performance benefit
that we can get by changing things incrementally.

Yes, I agree with you that complete rewriting of optimizer is huge
project with unpredictable influence on performance of some queries.
Changing things incrementally is good approach, but only if we are
moving in right direction.
I still not sure that introduction of async. operations is step in right
direction. Async.ops are used to significantly complicate code (since
you have to maintain state yourself). It will be bad if implementation
of each node has to deal with async state itself in its own manner.

My suggestion is to try to provide some generic mechanism for managing
state transition and have some scheduler which controls this process. It
should not be responsibility of node implementation to organize
asynchronous/parallel execution. Instead of this it should just produce
set of jobs which execution should be controlled by scheduler. First
implementation of scheduler can be quite simple. But later in can become
more clever: try to bind data to processors and do many other
optimizations.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#21)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 8:23 PM, Andres Freund <andres@anarazel.de> wrote:

c. Modify some nodes (perhaps start with nodeAgg.c) to allow them to
process a batch TupleTableSlot. This will require some tight loop to
aggregate the entire TupleTableSlot at once before returning.
d. Add function in execAmi.c which returns true or false depending on
if the node supports batch TupleTableSlots or not.
e. At executor startup determine if the entire plan tree supports
batch TupleTableSlots, if so enable batch scan mode.

It doesn't really need to be the entire tree. Even if you have a subtree
(say a parametrized index nested loop join) which doesn't support batch
mode, you'll likely still see performance benefits by building a batch
one layer above the non-batch-supporting node.

+1.

I've also wondered about building a new executor node that is sort of
a combination of Nested Loop and Hash Join, but capable of performing
multiple joins in a single operation. (Merge Join is different,
because it's actually matching up the two sides, not just doing
probing once per outer tuple.) So the plan tree would look something
like this:

Multiway Join
-> Seq Scan on driving_table
-> Index Scan on something
-> Index Scan on something_else
-> Hash
-> Seq Scan on other_thing
-> Hash
-> Seq Scan on other_thing_2
-> Index Scan on another_one

With the current structure, every level of the plan tree has its own
TupleTableSlot and we have to project into each new slot. Every level
has to go through ExecProcNode. So it seems to me that this sort of
structure might save quite a few cycles on deep join nests. I haven't
tried it, though.

With batching, things get even better for this sort of thing.
Assuming the joins are all basically semi-joins, either because they
were written that way or because they are probing unique indexes or
whatever, you can fetch a batch of tuples from the driving table, do
the first join for each tuple to create a matching batch of tuples,
and repeat for each join step. Then at the end you project.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Robert Haas (#15)
Re: asynchronous and vectorized execution

On 10.05.2016 20:26, Robert Haas wrote:

At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...
Interesting. You may notice that in "asynchronous mode" my prototype
works using a push model of sorts. Maybe that should be taken
further.

Latest information from ISP RAS guys: them have made good progress since
February: them have rewritten most of methods of Scan, Aggregate and
Join to LLVM API. Also then implemented automatic translation of
PostgreSQL backend functions to LLVM API.
As a result time of TPC-H Q1 query is reduced four times.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Robert Haas
robertmhaas@gmail.com
In reply to: Konstantin Knizhnik (#28)
Re: asynchronous and vectorized execution

On Wed, May 11, 2016 at 10:17 AM, Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

Yes, I agree with you that complete rewriting of optimizer is huge project
with unpredictable influence on performance of some queries.
Changing things incrementally is good approach, but only if we are moving in
right direction.
I still not sure that introduction of async. operations is step in right
direction. Async.ops are used to significantly complicate code (since you
have to maintain state yourself). It will be bad if implementation of each
node has to deal with async state itself in its own manner.

I don't really think so. The design I've proposed makes adding
asynchronous capability to a node pretty easy, with only minor
changes.

My suggestion is to try to provide some generic mechanism for managing state
transition and have some scheduler which controls this process. It should
not be responsibility of node implementation to organize
asynchronous/parallel execution. Instead of this it should just produce set
of jobs which execution should be controlled by scheduler. First
implementation of scheduler can be quite simple. But later in can become
more clever: try to bind data to processors and do many other optimizations.

Whereas this would require a massive rewrite.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#22)
Re: asynchronous and vectorized execution

On Tue, May 10, 2016 at 8:50 PM, Andres Freund <andres@anarazel.de> wrote:

That seems to suggest that we need to restructure how we get to calling
fmgr functions, before worrying about the actual fmgr call.

Any ideas on how to do that? ExecMakeFunctionResultNoSets() isn't
really doing a heck of a lot. Changing FuncExprState to use an array
rather than a linked list to store its arguments might help some. We
could also consider having an optimized path that skips the fn_strict
stuff if we can somehow deduce that no NULLs can occur in this
context, but that's a lot of work and new infrastructure. I feel like
maybe there's something higher-level we could do that would help more,
but I don't know what it is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#27)
Re: asynchronous and vectorized execution

On 2016-05-11 10:12:26 -0400, Robert Haas wrote:

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.

What I'm talking about is a query like this:

SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true;

Note that I said "!fdw case".

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant. Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.

Yeah. I think we need a system for signalling nodes as to when they
will be run to completion. But a Boolean is somehow unsatisfying;
LIMIT 1000000 is more like no LIMIT than it it is like LIMIT 1. I'm
tempted to add a numTuples field to every ExecutorState and give upper
nodes some way to set it, as a hint.

I was wondering whether we should hand down TupleVectorStates to lower
nodes, and their size determines the max batch size...

Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

I think we need a lock to examine MVCC visibility information. A pin
is enough to prevent a tuple from being removed, but not from having
its xmax and cmax overwritten at almost but not quite exactly the same
time.

We already batch visibility lookups in page-at-a-time
mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
after releasing the lock, but before the pin is released, without that
much effort. IIRC that isn't used for index lookups, but that's
probably a good idea.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#32)
Re: asynchronous and vectorized execution

On 2016-05-11 10:32:20 -0400, Robert Haas wrote:

On Tue, May 10, 2016 at 8:50 PM, Andres Freund <andres@anarazel.de> wrote:

That seems to suggest that we need to restructure how we get to calling
fmgr functions, before worrying about the actual fmgr call.

Any ideas on how to do that? ExecMakeFunctionResultNoSets() isn't
really doing a heck of a lot. Changing FuncExprState to use an array
rather than a linked list to store its arguments might help some. We
could also consider having an optimized path that skips the fn_strict
stuff if we can somehow deduce that no NULLs can occur in this
context, but that's a lot of work and new infrastructure. I feel like
maybe there's something higher-level we could do that would help more,
but I don't know what it is.

I think it's not just ExecMakeFunctionResultNoSets, it's the whole
call-stack which needs to be optimized together.

E.g. look at a few performance metrics for a simple seqscan query with a
bunch of ORed equality constraints:
SELECT count(*) FROM pgbench_accounts WHERE abalance = -1 OR abalance = -2 OR abalance = -3 OR abalance = -4 OR abalance = -5 OR abalance = -6 OR abalance = -7 OR abalance = -8 OR abalance = -9 OR abalance = -10;

perf record -g -p 27286 -F 5000 -e cycles:ppp,branch-misses,L1-icache-load-misses,iTLB-load-misses,L1-dcache-load-misses,dTLB-load-misses,LLC-load-misses sleep 3
6K cycles:ppp
6K branch-misses
1K L1-icache-load-misses
472 iTLB-load-misses
5K L1-dcache-load-misses
6K dTLB-load-misses
6K LLC-load-misses

You can see that a number of events sample at a high rate, especially
when you take the cycle samples into account.

cycles:
+   32.35%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+   14.51%  postgres  postgres           [.] slot_getattr
+    5.50%  postgres  postgres           [.] ExecEvalOr
+    5.22%  postgres  postgres           [.] check_stack_depth
branch-misses:
+   73.77%  postgres  postgres           [.] ExecQual
+   17.83%  postgres  postgres           [.] ExecEvalOr
+    1.49%  postgres  postgres           [.] heap_getnext
L1-icache-load-misses:
+    4.71%  postgres  [kernel.kallsyms]  [k] update_curr
+    4.37%  postgres  postgres           [.] hash_search_with_hash_value
+    3.91%  postgres  postgres           [.] heap_getnext
+    3.81%  postgres  [kernel.kallsyms]  [k] task_tick_fair
iTLB-load-misses:
+   27.57%  postgres  postgres           [.] LWLockAcquire
+   18.32%  postgres  postgres           [.] hash_search_with_hash_value
+    7.09%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+    3.06%  postgres  postgres           [.] ExecEvalConst
L1-dcache-load-misses:
+   20.35%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+   12.31%  postgres  postgres           [.] check_stack_depth
+    8.84%  postgres  postgres           [.] heap_getnext
+    8.00%  postgres  postgres           [.] slot_deform_tuple
+    7.15%  postgres  postgres           [.] HeapTupleSatisfiesMVCC
dTLB-load-misses:
+   50.13%  postgres  postgres           [.] ExecQual
+   41.36%  postgres  postgres           [.] ExecEvalOr
+    2.96%  postgres  postgres           [.] hash_search_with_hash_value
+    1.30%  postgres  postgres           [.] PinBuffer.isra.3
+    1.19%  postgres  postgres           [.] heap_page_prune_op
LLC-load-misses:
+   24.25%  postgres  postgres           [.] slot_deform_tuple
+   17.45%  postgres  postgres           [.] CheckForSerializableConflictOut
+   10.52%  postgres  postgres           [.] heapgetpage
+    9.55%  postgres  postgres           [.] HeapTupleSatisfiesMVCC
+    7.52%  postgres  postgres           [.] ExecMakeFunctionResultNoSets

For this workload, we expect a lot of LLC-load-misses as the workload is
lot bigger than memory, and it makes sense that they're in
slot_deform_tuple(),heapgetpage(), HeapTupleSatisfiesMVCC() (but uh
CheckForSerializableConflictOut?). One avenue to optimize is to make
those accesses easier to predict/prefetch, which they're atm likely not.

But leaving that aside, we can see that a lot of the cost is distributed
over ExecQual, ExecEvalOr, ExecMakeFunctionResultNoSets - all of which
judiciously use linked list. I suspect that by simplifying these
functions / datastructures *AND* by calling them over a batch of tuples,
instead of one-by-one we'd limit the time spent in them considerably.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#33)
Re: asynchronous and vectorized execution

On Wed, May 11, 2016 at 11:49 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-11 10:12:26 -0400, Robert Haas wrote:

I've to admit I'm not that convinced about the speedups in the !fdw
case. There seems to be a lot easier avenues for performance
improvements.

What I'm talking about is a query like this:

SELECT * FROM inheritance_tree_of_foreign_tables WHERE very_rarely_true;

Note that I said "!fdw case".

Oh, wow, I totally missed that exclamation point.

FWIW, I've even hacked something up for a bunch of simple queries, and
the performance improvements were significant. Besides it only being a
weekend hack project, the big thing I got stuck on was considering how
to exactly determine when to batch and not to batch.

Yeah. I think we need a system for signalling nodes as to when they
will be run to completion. But a Boolean is somehow unsatisfying;
LIMIT 1000000 is more like no LIMIT than it it is like LIMIT 1. I'm
tempted to add a numTuples field to every ExecutorState and give upper
nodes some way to set it, as a hint.

I was wondering whether we should hand down TupleVectorStates to lower
nodes, and their size determines the max batch size...

There's some appeal to that, but it seems complicated to make work.

Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

I think we need a lock to examine MVCC visibility information. A pin
is enough to prevent a tuple from being removed, but not from having
its xmax and cmax overwritten at almost but not quite exactly the same
time.

We already batch visibility lookups in page-at-a-time
mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
after releasing the lock, but before the pin is released, without that
much effort. IIRC that isn't used for index lookups, but that's
probably a good idea.

The trouble with that is that if you fail the qual, you have to relock
the page. Which kinda sucks, if the qual is really simple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#35)
Re: asynchronous and vectorized execution

On 2016-05-11 12:27:55 -0400, Robert Haas wrote:

On Wed, May 11, 2016 at 11:49 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-11 10:12:26 -0400, Robert Haas wrote:

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

I think we need a lock to examine MVCC visibility information. A pin
is enough to prevent a tuple from being removed, but not from having
its xmax and cmax overwritten at almost but not quite exactly the same
time.

We already batch visibility lookups in page-at-a-time
mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
after releasing the lock, but before the pin is released, without that
much effort. IIRC that isn't used for index lookups, but that's
probably a good idea.

The trouble with that is that if you fail the qual, you have to relock
the page. Which kinda sucks, if the qual is really simple.

Hm? I'm missing something here? We currently do the visibility checks in
bulk for the whole page. After that we release the page lock. What
prevents us from executing the quals directly after that? And why would
you need to relock the page?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#36)
Re: asynchronous and vectorized execution

On Wed, May 11, 2016 at 12:30 PM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-11 12:27:55 -0400, Robert Haas wrote:

On Wed, May 11, 2016 at 11:49 AM, Andres Freund <andres@anarazel.de> wrote:

On 2016-05-11 10:12:26 -0400, Robert Haas wrote:

Hm. Do we really have to keep the page locked in the page-at-a-time
mode? Shouldn't the pin suffice?

I think we need a lock to examine MVCC visibility information. A pin
is enough to prevent a tuple from being removed, but not from having
its xmax and cmax overwritten at almost but not quite exactly the same
time.

We already batch visibility lookups in page-at-a-time
mode. Cf. heapgetpage() / scan->rs_vistuples. So we can evaluate quals
after releasing the lock, but before the pin is released, without that
much effort. IIRC that isn't used for index lookups, but that's
probably a good idea.

The trouble with that is that if you fail the qual, you have to relock
the page. Which kinda sucks, if the qual is really simple.

Hm? I'm missing something here? We currently do the visibility checks in
bulk for the whole page. After that we release the page lock. What
prevents us from executing the quals directly after that? And why would
you need to relock the page?

Oh, yeah, in page-at-a-time mode we can release the lock first. I was
thinking at what to do in tuple-at-a-time mode (i.e. when the page is
not all-visible).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38Andreas Seltenreich
seltenreich@gmx.de
In reply to: Konstantin Knizhnik (#30)
Just-in-time compiling things (was: asynchronous and vectorized execution)

Konstantin Knizhnik writes:

Latest information from ISP RAS guys: them have made good progress
since February: them have rewritten most of methods of Scan, Aggregate
and Join to LLVM API.

Is their work available somewhere? I'm experimenting in that area as
well, although I'm using libFirm instead of LLVM. I wonder what their
motivation to rewrite backend code in LLVM IR was, since I am following
the approach of keeping the IR around when compiling the vanilla
postgres C code, possibly inlining it during JIT and then doing
optimizations on this IR. That way the logic doesn't have to be
duplicated.

regrads
Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Andreas Seltenreich (#38)
Re: Just-in-time compiling things

On 05/14/2016 12:10 PM, Andreas Seltenreich wrote:

Konstantin Knizhnik writes:

Latest information from ISP RAS guys: them have made good progress
since February: them have rewritten most of methods of Scan, Aggregate
and Join to LLVM API.

Is their work available somewhere? I'm experimenting in that area as
well, although I'm using libFirm instead of LLVM. I wonder what their
motivation to rewrite backend code in LLVM IR was, since I am following
the approach of keeping the IR around when compiling the vanilla
postgres C code, possibly inlining it during JIT and then doing
optimizations on this IR. That way the logic doesn't have to be
duplicated.

The work is not yet completed but finally it will be definitely put to open source.
I am going to talk a little bit about this project at PGcon in Ottawa at lighting talks, although I do not know details of the project myself.
The main difference of their approach comparing with Vitesse DB is that them implement a way of automatic conversion of PostgreSQL operators to LLVM IR.
So instead of rewritting ~2000 operators manually (a lot of work and errors), them implement converter which transform the code of this operators to ... C++ code producing LLVM IR. So manually them need to rewrite only plan nodes. Them already implemented
most of nodes (SeqScan, Sort, HashJoin,...) which allows to execute all TPC-H queries. Result will be published soon. The larghest advantage is definitely at Q1 - about 4 times. It is worser than Vitesse DB (8 times) and with manually written operators (7
times). The most probable reason causing such performance penalty is overflow checking: in manually written LLVM code it can be done in more efficient way using correspondent assembler instruction than in code automatically converted from standard C.
But ISP RAS guys are going to fix this problem and improve automatic conversion quality.

I include in CC members of ISP RAS team - you can ask them questions directly.

regrads
Andreas

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40Oleg Bartunov
obartunov@gmail.com
In reply to: Andreas Seltenreich (#38)
Re: Just-in-time compiling things (was: asynchronous and vectorized execution)

On Sat, May 14, 2016 at 12:10 PM, Andreas Seltenreich
<seltenreich@gmx.de> wrote:

Konstantin Knizhnik writes:

Latest information from ISP RAS guys: them have made good progress
since February: them have rewritten most of methods of Scan, Aggregate
and Join to LLVM API.

Is their work available somewhere? I'm experimenting in that area as
well, although I'm using libFirm instead of LLVM. I wonder what their
motivation to rewrite backend code in LLVM IR was, since I am following
the approach of keeping the IR around when compiling the vanilla
postgres C code, possibly inlining it during JIT and then doing
optimizations on this IR. That way the logic doesn't have to be
duplicated.

I have discussed availability of their work and the consensus was that
eventually their code will be open source, but not right now, since it
is not ready to be published. I'll meet (after PGCon) their
management staff and discuss how we can work together.

regrads
Andreas

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41Amit Khandekar
amitdkhan.pg@gmail.com
In reply to: Robert Haas (#1)
Re: asynchronous and vectorized execution

We may also want to consider handling abstract events such as
"tuples-are-available-at-plan-node-X".

One benefit is : we can combine this with batch processing. For e.g. in
case of an Append node containing foreign scans, its parent node may not
want to process the Append node result until Append is ready with at least
1000 rows. In that case, Append node needs to raise an event
"n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready
event will wake up the foreign scan, but it may not eventually cause its
parent Append node to in turn wake up it's parent.

Other benefit (which I am not sure how significant it is) is this part of
the event : "at-plan-node-X". For e.g., for an Append node having 10
foreign scans, when a foreign scan wakes up and becomes ready with
tuple(s), it's parent node (i.e. Append) will be executed. But it would
speed up things if it knows which of it's foreign scan nodes are ready.
From Robert's prototype, I can see that it can find that out by checking
the result_ready field of each foreign scan plan state. But if it knows
from the event that the node-X is the one who is ready, it can directly
take tuples from there. Another thing is, we may want to give the Append
node a chance to know all those nodes that are ready, instead of just one
node.

How we can do this event abstraction is the other question. We can have one
latch for each of the event, and each node would raise its own event by
setting the corresponding latch. But I am not sure about latches within a
single process as against one process waking up another process. Or else,
some internal event structures needs to be present (in estate ?), which
then ExecProcNode would use when it does the event looping to wake up (i.e.
execute) required nodes.

Also, the WaitEvent.user_data field can have some more info besides the
plan state. It can have its parent PlanState stored, so that we don't have
to have parent field in plan state. It also can have some more data such as
"n-tuples-available".

On 9 May 2016 at 23:03, Robert Haas <robertmhaas@gmail.com> wrote:

Show quoted text

Hi,

I realize that we haven't gotten 9.6beta1 out the door yet, but I
think we can't really wait much longer to start having at least some
discussion of 9.7 topics, so I'm going to go ahead and put this one
out there. I believe there are other people thinking about these
topics as well, including Andres Freund, Kyotaro Horiguchi, and
probably some folks at 2ndQuadrant (but I don't know exactly who). To
make a long story short, I think there are several different areas
where we should consider major upgrades to our executor. It's too
slow and it doesn't do everything we want it to do. The main things
on my mind are:

1. asynchronous execution, by which I mean the ability of a node to
somehow say that it will generate a tuple eventually, but is not yet
ready, so that the executor can go run some other part of the plan
tree while it waits. This case most obviously arises for foreign
tables, where it makes little sense to block on I/O if some other part
of the query tree could benefit from the CPU; consider SELECT * FROM
lt WHERE qual UNION SELECT * FROM ft WHERE qual. It is also a problem
for parallel query: in a parallel sequential scan, the next worker can
begin reading the next block even if the current block hasn't yet been
received from the OS. Whether or not this will be efficient is a
research question, but it can be done. However, imagine a parallel
scan of a btree index: we don't know what page to scan next until we
read the previous page and examine the next-pointer. In the meantime,
any worker that arrives at that scan node has no choice but to block.
It would be better if the scan node could instead say "hey, thanks for
coming but I'm really not ready to be on-CPU just at the moment" and
potentially allow the worker to go work in some other part of the
query tree. For that worker to actually find useful work to do
elsewhere, we'll probably need it to be the case either that the table
is partitioned or the original query will need to involve UNION ALL,
but those are not silly cases to worry about, particularly if we get
native partitioning in 9.7.

2. vectorized execution, by which I mean the ability of a node to
return tuples in batches rather than one by one. Andres has opined
more than once that repeated trips through ExecProcNode defeat the
ability of the CPU to do branch prediction correctly, slowing the
whole system down, and that they also result in poor CPU cache
behavior, since we jump all over the place executing a little bit of
code from each node before moving onto the next rather than running
one bit of code first, and then another later. I think that's
probably right. For example, consider a 5-table join where all of
the joins are implemented as hash tables. If this query plan is going
to be run to completion, it would make much more sense to fetch, say,
100 tuples from the driving scan and then probe for all of those in
the first hash table, and then probe for all of those in the second
hash table, and so on. What we do instead is fetch one tuple and
probe for it in all 5 hash tables, and then repeat. If one of those
hash tables would fit in the CPU cache but all five together will not,
that seems likely to be a lot worse. But even just ignoring the CPU
cache aspect of it for a minute, suppose you want to write a loop to
perform a hash join. The inner loop fetches the next tuple from the
probe table and does a hash lookup. Right now, fetching the next
tuple from the probe table means calling a function which in turn
calls another function which probably calls another function which
probably calls another function and now about 4 layers down we
actually get the next tuple. If the scan returned a batch of tuples
to the hash join, fetching the next tuple from the batch would
probably be 0 or 1 function calls rather than ... more. Admittedly,
you've got to consider the cost of marshaling the batches but I'm
optimistic that there are cycles to be squeezed out here. We might
also want to consider storing batches of tuples in a column-optimized
rather than row-optimized format so that iterating through one or two
attributes across every tuple in the batch touches the minimal number
of cache lines.

Obviously, both of these are big projects that could touch a large
amount of executor code, and there may be other ideas, in addition to
these, which some of you may be thinking about that could also touch a
large amount of executor code. It would be nice to agree on a way
forward that minimizes code churn and maximizes everyone's attempt to
contribute without conflicting with each other. Also, it seems
desirable to enable, as far as possible, incremental development - in
particular, it seems to me that it would be good to pick a design that
doesn't require massive changes to every node all at once. A single
patch that adds some capability to every node in the executor in one
fell swoop is going to be too large to review effectively.

My proposal for how to do this is to make ExecProcNode function as a
backward-compatibility wrapper. For asynchronous execution, a node
might return a not-ready-yet indication, but if that node is called
via ExecProcNode, it means the caller isn't prepared to receive such
an indication, so ExecProcNode will just wait for the node to become
ready and then return the tuple. Similarly, for vectorized execution,
a node might return a bunch of tuples all at once. ExecProcNode will
extract the first one and return it to the caller, and subsequent
calls to ExecProcNode will iterate through the rest of the batch, only
calling the underlying node-specific function when the batch is
exhausted. In this way, code that doesn't know about the new stuff
can continue to work pretty much as it does today. Also, and I think
this is important, nodes don't need the permission of their parent
node to use these new capabilities. They can use them whenever they
wish, without worrying about whether the upper node is prepared to
deal with it. If not, ExecProcNode will paper over the problem. This
seems to me to be a good way to keep the code simple.

For asynchronous execution, I have gone so far as to mock up a bit of
what this might look like. This shouldn't be taken very seriously at
this point, but I'm attaching a few very-much-WIP patches to show the
direction of my line of thinking. Basically, I propose to have
ExecBlah (that is, ExecBitmapHeapScan, ExecAppend, etc.) return tuples
by putting them into a new PlanState member called "result", which is
just a Node * so that we can support multiple types of results,
instead of returning them. There is also a result_ready boolean, so
that a node can return without setting this Boolean to engage
asynchronous behavior. This triggers an "event loop", which
repeatedly waits for FDs chosen by waiting nodes to become readable
and/or writeable and then gives the node a chance to react.
Eventually, the waiting node will stop waiting and have a result
ready, at which point the event loop will give the parent of that node
a chance to run. If that node consequently becomes ready, then its
parent gets a chance to run. Eventually (we hope), the node for which
we're waiting becomes ready, and we can then read a result tuple.
With some more work, this seems like it can handle the FDW case, but I
haven't worked out how to make it handle the related parallel query
case. What we want there is to wait not for the readiness of an FD
but rather for some other process involved in the parallel query to
reach a point where it can welcome assistance executing that node. I
don't know exactly what the signaling for that should look like yet -
maybe setting the process latch or something.

By the way, one smaller executor project that I think we should also
look at has to do with this comment in nodeSeqScan.c:

static bool
SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
{
/*
* Note that unlike IndexScan, SeqScan never use keys in
heap_beginscan
* (and this is very bad) - so, here we do not check are keys ok
or not.
*/
return true;
}

Some quick prototyping by my colleague Dilip Kumar suggests that, in
fact, there are cases where pushing down keys into heap_beginscan()
could be significantly faster. Some care is required here because any
functions we execute as scan keys are run with the buffer locked, so
we had better not run anything very complicated. But doing this for
simple things like integer equality operators seems like it could save
quite a few buffer lock/unlock cycles and some other executor overhead
as well.

Thoughts, ideas, suggestions, etc. very welcome.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#42Robert Haas
robertmhaas@gmail.com
In reply to: Amit Khandekar (#41)
Re: asynchronous and vectorized execution

On Wed, Jun 29, 2016 at 11:00 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

We may also want to consider handling abstract events such as
"tuples-are-available-at-plan-node-X".

One benefit is : we can combine this with batch processing. For e.g. in case
of an Append node containing foreign scans, its parent node may not want to
process the Append node result until Append is ready with at least 1000
rows. In that case, Append node needs to raise an event
"n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event
will wake up the foreign scan, but it may not eventually cause its parent
Append node to in turn wake up it's parent.

Right, I agree. I think this case only arises in parallel query. In
serial execution, there's not really any way for a plan node to just
become ready other than an FD or latch event or the parent becoming
ready. But in parallel query it can happen, of course, because some
other backend can do work that makes that node ready to produce
tuples.

It's not necessarily the case that we have to deal with this in the
initial patches for this feature, because the most obvious win for
this sort of thing is when we have an Append of ForeignScan plans.
Sure, parallel query has interesting cases, too, but a prototype that
just handles Append over a bunch of postgres_fdw ForeignScans would be
pretty cool. I suggest that we make that the initial goal here.

How we can do this event abstraction is the other question. We can have one
latch for each of the event, and each node would raise its own event by
setting the corresponding latch. But I am not sure about latches within a
single process as against one process waking up another process. Or else,
some internal event structures needs to be present (in estate ?), which then
ExecProcNode would use when it does the event looping to wake up (i.e.
execute) required nodes.

I think adding more latches would be a bad idea. What I think we
should do instead is add two additional data structures to dynamic
shared memory:

1. An array of PGPROC * pointers for all of the workers. Processes
add their PGPROC * to this array as they start up. Then, parallel.h
can expose new API ParallelWorkerSetLatchesForGroup(void). In the
leader, this sets the latch for every worker process for every
parallel context with which the leader is associated; in a worker, it
sets the latch for other processes in the parallel group, and the
leader also.

2. An array of executor nodes where one process might do something
that allows other processes to make progress on that node. This would
be set up somehow by execParallel.c, which would need to somehow
figure out which plan nodes want to be included in the list. When an
executor node does something that might unblock other workers, it
calls ParallelWorkerSetLatchesForGroup() and the async stuff then
tries calling all of the nodes in this array again to see if any of
them now think that they've got tuples to return (or just to let them
do additional work without returning tuples).

Also, the WaitEvent.user_data field can have some more info besides the plan
state. It can have its parent PlanState stored, so that we don't have to
have parent field in plan state. It also can have some more data such as
"n-tuples-available".

I don't think that works, because execution may need to flow
arbitrarily far up the tree. Just knowing the immediate parent isn't
good enough. If it generates a tuple, then you have to in turn call
it's parent, and that one then produces a tuple, you have to continue
on even further up the tree. I think it's going to be very awkward to
make this work without those parent pointers.

BTW, we also need to benchmark those changes to add the parent
pointers and change the return conventions and see if they have any
measurable impact on performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Robert Haas (#42)
9 attachment(s)
Re: asynchronous and vectorized execution

Hello,

At Tue, 5 Jul 2016 11:39:41 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobnQ6ZpsubttBYC=pSLQ6d=0GuSgBsUFoaARMrie_75BA@mail.gmail.com>

On Wed, Jun 29, 2016 at 11:00 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

We may also want to consider handling abstract events such as
"tuples-are-available-at-plan-node-X".

One benefit is : we can combine this with batch processing. For e.g. in case
of an Append node containing foreign scans, its parent node may not want to
process the Append node result until Append is ready with at least 1000
rows. In that case, Append node needs to raise an event
"n-tuples-are-ready"; we cannot just rely on fd-ready events. fd-ready event
will wake up the foreign scan, but it may not eventually cause its parent
Append node to in turn wake up it's parent.

Right, I agree. I think this case only arises in parallel query. In
serial execution, there's not really any way for a plan node to just
become ready other than an FD or latch event or the parent becoming
ready. But in parallel query it can happen, of course, because some
other backend can do work that makes that node ready to produce
tuples.

It's not necessarily the case that we have to deal with this in the
initial patches for this feature, because the most obvious win for
this sort of thing is when we have an Append of ForeignScan plans.
Sure, parallel query has interesting cases, too, but a prototype that
just handles Append over a bunch of postgres_fdw ForeignScans would be
pretty cool. I suggest that we make that the initial goal here.

This seems to be a good opportunity to show this patch. The
attched patch set does async execution of foreignscan
(postgres_fdw) on the Robert's first infrastructure, with some
modification.

ExecAsyncWaitForNode can get into an inifite-waiting by recursive
calls of ExecAsyncWaitForNode caused by ExecProcNode called from
async-unaware nodes. Such recursive calls cause a wait on
already-ready nodes.

I solved that in the patch set by allocating a separate
async-execution context for every async-execution subtrees, which
is made by ExecProcNode, instead of one async-exec context for
the whole execution tree. This works fine but the way switching
contexts seems ugly. This may also be solved by make
ExecAsyncWaitForNode return when no node to wait even if the
waiting node is not ready. This might keep the async-exec context
(state) simpler so I'll try this.

How we can do this event abstraction is the other question. We can have one
latch for each of the event, and each node would raise its own event by
setting the corresponding latch. But I am not sure about latches within a
single process as against one process waking up another process. Or else,
some internal event structures needs to be present (in estate ?), which then
ExecProcNode would use when it does the event looping to wake up (i.e.
execute) required nodes.

I think adding more latches would be a bad idea. What I think we
should do instead is add two additional data structures to dynamic
shared memory:

1. An array of PGPROC * pointers for all of the workers. Processes
add their PGPROC * to this array as they start up. Then, parallel.h
can expose new API ParallelWorkerSetLatchesForGroup(void). In the
leader, this sets the latch for every worker process for every
parallel context with which the leader is associated; in a worker, it
sets the latch for other processes in the parallel group, and the
leader also.

2. An array of executor nodes where one process might do something
that allows other processes to make progress on that node. This would
be set up somehow by execParallel.c, which would need to somehow
figure out which plan nodes want to be included in the list. When an
executor node does something that might unblock other workers, it
calls ParallelWorkerSetLatchesForGroup() and the async stuff then
tries calling all of the nodes in this array again to see if any of
them now think that they've got tuples to return (or just to let them
do additional work without returning tuples).

Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
or something like instead of latches?

Also, the WaitEvent.user_data field can have some more info besides the plan
state. It can have its parent PlanState stored, so that we don't have to
have parent field in plan state. It also can have some more data such as
"n-tuples-available".

I don't think that works, because execution may need to flow
arbitrarily far up the tree. Just knowing the immediate parent isn't
good enough. If it generates a tuple, then you have to in turn call
it's parent, and that one then produces a tuple, you have to continue
on even further up the tree. I think it's going to be very awkward to
make this work without those parent pointers.

Basically agreed, but going up too far was bad for the reason
above.

BTW, we also need to benchmark those changes to add the parent
pointers and change the return conventions and see if they have any
measurable impact on performance.

I have to bring you a bad news.

With the attached patch, an append on four foreign scans on one
server (at local) performs faster by about 10% and by twice for
three or four foreign scns on separate foreign servers
(connections) respectively, but only when compiled with -O0. I
found that it can take hopelessly small amount of advantage from
compiler optimization, while unpatched version gets faster.

Anyway, the current state of this patch is attached.

For binaries compiled with both -O0 and -O2, ran a simple query
"select sum(a) from <table>" on tables generated by the attached
script, t0, pl, pf0 and pf1 which are a local table, an append on
local tables, an append on foreign tables on the same foreign
server and an append on foreign tables on different foreign
servers respectively. The numbers are the mean values of ten
times run.

average(ms) stddev
patched-O0
t0 891.3934 18.74902154
pl 416.3298 47.38902802
pf0 13523.0777 87.45769903
pf1 3376.6415 183.3578028

patched-O2:
t0 891.4309 5.245807775
pl 408.2932 1.04260004
pf0 13640.3551 52.52211814
pf1 3470.1739 262.3522963

not-patched-O0
t0 845.3927 18.98379876
pl 363.4933 4.142091341
pf0 14986.1284 23.07288416
pf1 14961.0596 127.2587286

not-patched-O2
t0 429.8462 31.51970532
pl 176.3959 0.237832551
pf0 8129.3762 44.68774182
pf1 8211.6319 97.93206675

By the way, running the attached testrun.sh, the result for the
first one or two runs of pf0 is faster by about 30%-50% than the
rest for some reason unknown to me...

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patchtext/x-patch; charset=us-asciiDownload
From 19c42997100750febcf85879130a5f95e291257b Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 4 May 2016 12:19:03 -0400
Subject: [PATCH 1/7] Modify PlanState to include a pointer to the parent
 PlanState.

---
 src/backend/executor/execMain.c           | 22 ++++++++++++++--------
 src/backend/executor/execProcnode.c       |  5 ++++-
 src/backend/executor/nodeAgg.c            |  3 ++-
 src/backend/executor/nodeAppend.c         |  3 ++-
 src/backend/executor/nodeBitmapAnd.c      |  3 ++-
 src/backend/executor/nodeBitmapHeapscan.c |  3 ++-
 src/backend/executor/nodeBitmapOr.c       |  3 ++-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeGather.c         |  3 ++-
 src/backend/executor/nodeGroup.c          |  3 ++-
 src/backend/executor/nodeHash.c           |  3 ++-
 src/backend/executor/nodeHashjoin.c       |  6 ++++--
 src/backend/executor/nodeLimit.c          |  3 ++-
 src/backend/executor/nodeLockRows.c       |  3 ++-
 src/backend/executor/nodeMaterial.c       |  3 ++-
 src/backend/executor/nodeMergeAppend.c    |  3 ++-
 src/backend/executor/nodeMergejoin.c      |  4 +++-
 src/backend/executor/nodeModifyTable.c    |  3 ++-
 src/backend/executor/nodeNestloop.c       |  6 ++++--
 src/backend/executor/nodeRecursiveunion.c |  6 ++++--
 src/backend/executor/nodeResult.c         |  3 ++-
 src/backend/executor/nodeSetOp.c          |  3 ++-
 src/backend/executor/nodeSort.c           |  3 ++-
 src/backend/executor/nodeSubplan.c        |  1 +
 src/backend/executor/nodeSubqueryscan.c   |  3 ++-
 src/backend/executor/nodeUnique.c         |  3 ++-
 src/backend/executor/nodeWindowAgg.c      |  3 ++-
 src/include/executor/executor.h           |  3 ++-
 src/include/nodes/execnodes.h             |  2 ++
 29 files changed, 77 insertions(+), 37 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..ac6d62c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -923,7 +923,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
-	 * ExecInitSubPlan expects to be able to find these entries.
+	 * ExecInitSubPlan expects to be able to find these entries. Since the
+	 * main plan tree hasn't been initialized yet, we have to pass NULL as the
+	 * parent node to ExecInitNode; ExecInitSubPlan also takes responsibility
+	 * for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	i = 1;						/* subplan indices count from 1 */
@@ -943,7 +946,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
-		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
+		subplanstate = ExecInitNode(subplan, estate, NULL, sp_eflags);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
@@ -954,9 +957,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize the private state information for all the nodes in the query
 	 * tree.  This opens files, allocates storage and leaves us ready to start
-	 * processing tuples.
+	 * processing tuples.  This is the root planstate node; it has no parent.
 	 */
-	planstate = ExecInitNode(plan, estate, eflags);
+	planstate = ExecInitNode(plan, estate, NULL, eflags);
 
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
@@ -2849,7 +2852,9 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * ExecInitSubPlan expects to be able to find these entries. Some of the
 	 * SubPlans might not be used in the part of the plan tree we intend to
 	 * run, but since it's not easy to tell which, we just initialize them
-	 * all.
+	 * all.  Since the main plan tree hasn't been initialized yet, we have to
+	 * pass NULL as the parent node to ExecInitNode; ExecInitSubPlan also
+	 * takes responsibility for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	foreach(l, parentestate->es_plannedstmt->subplans)
@@ -2857,7 +2862,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;
 
-		subplanstate = ExecInitNode(subplan, estate, 0);
+		subplanstate = ExecInitNode(subplan, estate, NULL, 0);
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
 	}
@@ -2865,9 +2870,10 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	/*
 	 * Initialize the private state information for all the nodes in the part
 	 * of the plan tree we need to run.  This opens files, allocates storage
-	 * and leaves us ready to start processing tuples.
+	 * and leaves us ready to start processing tuples.  This is the root plan
+	 * node; it has no parent.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->planstate = ExecInitNode(planTree, estate, NULL, 0);
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..680ca4b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -133,7 +133,7 @@
  * ------------------------------------------------------------------------
  */
 PlanState *
-ExecInitNode(Plan *node, EState *estate, int eflags)
+ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 {
 	PlanState  *result;
 	List	   *subps;
@@ -340,6 +340,9 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 			break;
 	}
 
+	/* Set parent pointer. */
+	result->parent = parent;
+
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
 	 * a separate list for us.
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b3187e6..2c11acb 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2427,7 +2427,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(aggstate) =
+		ExecInitNode(outerPlan, estate, &aggstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type.
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..beb4ab8 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -165,7 +165,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, &appendstate->ps,
+										   eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index c39d790..6405fa4 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -81,7 +81,8 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmapandstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..2ba5cd0 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -646,7 +646,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * relation's indexes, and we want to be sure we have acquired a lock on
 	 * the relation first.
 	 */
-	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate,
+											 &scanstate->ss.ps, eflags);
 
 	/*
 	 * all done.
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 7e928eb..faa3a37 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -82,7 +82,8 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmaporstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..7d9160d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -224,7 +224,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	/* Initialize any outer plan. */
 	if (outerPlan(node))
 		outerPlanState(scanstate) =
-			ExecInitNode(outerPlan(node), estate, eflags);
+			ExecInitNode(outerPlan(node), estate, &scanstate->ss.ps, eflags);
 
 	/*
 	 * Tell the FDW to initialize the scan.
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 313b234..6da52b3 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -97,7 +97,8 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gatherstate) =
+		ExecInitNode(outerNode, estate, &gatherstate->ps, eflags);
 
 	gatherstate->ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index dcf5175..3c066fc 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -233,7 +233,8 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(grpstate) =
+		ExecInitNode(outerPlan(node), estate, &grpstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 9ed09a7..5e78de0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -200,7 +200,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(hashstate) =
+		ExecInitNode(outerPlan(node), estate, &hashstate->ps, eflags);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 369e666..a7a908a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -486,8 +486,10 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	outerNode = outerPlan(node);
 	hashNode = (Hash *) innerPlan(node);
 
-	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags);
-	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
+	outerPlanState(hjstate) =
+		ExecInitNode(outerNode, estate, &hjstate->js.ps, eflags);
+	innerPlanState(hjstate) =
+		ExecInitNode((Plan *) hashNode, estate, &hjstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index faf32e1..97267c5 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -412,7 +412,8 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(limitstate) =
+		ExecInitNode(outerPlan, estate, &limitstate->ps, eflags);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 4ebcaff..c4b5333 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -376,7 +376,8 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(lrstate) =
+		ExecInitNode(outerPlan, estate, &lrstate->ps, eflags);
 
 	/*
 	 * LockRows nodes do no projections, so initialize projection info for
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9ab03f3..82e31c1 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -219,7 +219,8 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
 	outerPlan = outerPlan(node);
-	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(matstate) =
+		ExecInitNode(outerPlan, estate, &matstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e271927..ae0e8dc 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -112,7 +112,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		mergeplanstates[i] =
+			ExecInitNode(initNode, estate, &mergestate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 6db09b8..cd8d6c6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1522,8 +1522,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 *
 	 * inner child must support MARK/RESTORE.
 	 */
-	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(mergestate) =
+		ExecInitNode(outerPlan(node), estate, &mergestate->js.ps, eflags);
 	innerPlanState(mergestate) = ExecInitNode(innerPlan(node), estate,
+											  &mergestate->js.ps,
 											  eflags | EXEC_FLAG_MARK);
 
 	/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..95cc2c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1618,7 +1618,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
-		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
+		mtstate->mt_plans[i] =
+			ExecInitNode(subplan, estate, &mtstate->ps, eflags);
 
 		/* Also let FDWs init themselves for foreign-table result rels */
 		if (!resultRelInfo->ri_usesFdwDirectModify &&
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 555fa09..1895b60 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -340,12 +340,14 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
-	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(nlstate) =
+		ExecInitNode(outerPlan(node), estate, &nlstate->js.ps, eflags);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
-	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
+	innerPlanState(nlstate) =
+		ExecInitNode(innerPlan(node), estate, &nlstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index e76405a..2328ef3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -245,8 +245,10 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags);
-	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags);
+	outerPlanState(rustate) =
+		ExecInitNode(outerPlan(node), estate, &rustate->ps, eflags);
+	innerPlanState(rustate) =
+		ExecInitNode(innerPlan(node), estate, &rustate->ps, eflags);
 
 	/*
 	 * If hashing, precompute fmgr lookup data for inner loop, and create the
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 4007b76..0d2de14 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -250,7 +250,8 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(resstate) =
+		ExecInitNode(outerPlan(node), estate, &resstate->ps, eflags);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 2d81d46..7a3b67c 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -537,7 +537,8 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
-	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(setopstate) =
+		ExecInitNode(outerPlan(node), estate, &setopstate->ps, eflags);
 
 	/*
 	 * setop nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a34dcc5..0286a7f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -199,7 +199,8 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(sortstate) =
+		ExecInitNode(outerPlan(node), estate, &sortstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index e503494..458e254 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -707,6 +707,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 
 	/* ... and to its parent's state */
 	sstate->parent = parent;
+	sstate->planstate->parent = parent;
 
 	/* Initialize subexpressions */
 	sstate->testexpr = ExecInitExpr((Expr *) subplan->testexpr, parent);
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 9bafc62..cb007a5 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -136,7 +136,8 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	/*
 	 * initialize subquery
 	 */
-	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags);
+	subquerystate->subplan =
+		ExecInitNode(node->subplan, estate, &subquerystate->ss.ps, eflags);
 
 	subquerystate->ss.ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 4caae34..5d13a89 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -145,7 +145,8 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(uniquestate) =
+		ExecInitNode(outerPlan(node), estate, &uniquestate->ps, eflags);
 
 	/*
 	 * unique nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index d4c88a1..bae713b 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1844,7 +1844,8 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(winstate) =
+		ExecInitNode(outerPlan, estate, &winstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type (which is also the tuple type that we'll
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39521ed..28c0c2e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -221,7 +221,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 /*
  * prototypes from functions in execProcnode.c
  */
-extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
+extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
+			 int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7fd7bd..4b18436 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1030,6 +1030,8 @@ typedef struct PlanState
 								 * nodes point to one EState for the whole
 								 * top-level plan */
 
+	struct PlanState *parent;	/* node which will receive tuples from us */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
1.8.3.1

0002-Modify-PlanState-to-have-result-result_ready-fields..patchtext/x-patch; charset=us-asciiDownload
From df0f3491e40164f8380c992c7f06a3361f486524 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 6 May 2016 13:01:48 -0400
Subject: [PATCH 2/7] Modify PlanState to have result/result_ready fields.
 Modify executor to use them instead of returning tuples directly.

---
 src/backend/executor/execProcnode.c       | 75 ++++++++++++++++++-------------
 src/backend/executor/execScan.c           | 26 +++++++----
 src/backend/executor/nodeAgg.c            | 13 +++---
 src/backend/executor/nodeAppend.c         | 11 +++--
 src/backend/executor/nodeBitmapHeapscan.c |  2 +-
 src/backend/executor/nodeCtescan.c        |  2 +-
 src/backend/executor/nodeCustom.c         |  4 +-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeFunctionscan.c   |  2 +-
 src/backend/executor/nodeGather.c         | 17 ++++---
 src/backend/executor/nodeGroup.c          | 24 +++++++---
 src/backend/executor/nodeHash.c           |  3 +-
 src/backend/executor/nodeHashjoin.c       | 29 ++++++++----
 src/backend/executor/nodeIndexonlyscan.c  |  2 +-
 src/backend/executor/nodeIndexscan.c      |  2 +-
 src/backend/executor/nodeLimit.c          | 42 ++++++++++++-----
 src/backend/executor/nodeLockRows.c       |  9 ++--
 src/backend/executor/nodeMaterial.c       | 21 ++++++---
 src/backend/executor/nodeMergeAppend.c    |  4 +-
 src/backend/executor/nodeMergejoin.c      | 74 ++++++++++++++++++++++--------
 src/backend/executor/nodeModifyTable.c    | 15 ++++---
 src/backend/executor/nodeNestloop.c       | 16 ++++---
 src/backend/executor/nodeRecursiveunion.c | 10 +++--
 src/backend/executor/nodeResult.c         | 20 ++++++---
 src/backend/executor/nodeSamplescan.c     |  2 +-
 src/backend/executor/nodeSeqscan.c        |  2 +-
 src/backend/executor/nodeSetOp.c          | 14 +++---
 src/backend/executor/nodeSort.c           |  4 +-
 src/backend/executor/nodeSubqueryscan.c   |  2 +-
 src/backend/executor/nodeTidscan.c        |  2 +-
 src/backend/executor/nodeUnique.c         |  8 ++--
 src/backend/executor/nodeValuesscan.c     |  2 +-
 src/backend/executor/nodeWindowAgg.c      | 17 ++++---
 src/backend/executor/nodeWorktablescan.c  |  2 +-
 src/include/executor/executor.h           | 11 ++++-
 src/include/executor/nodeAgg.h            |  2 +-
 src/include/executor/nodeAppend.h         |  2 +-
 src/include/executor/nodeBitmapHeapscan.h |  2 +-
 src/include/executor/nodeCtescan.h        |  2 +-
 src/include/executor/nodeCustom.h         |  2 +-
 src/include/executor/nodeForeignscan.h    |  2 +-
 src/include/executor/nodeFunctionscan.h   |  2 +-
 src/include/executor/nodeGather.h         |  2 +-
 src/include/executor/nodeGroup.h          |  2 +-
 src/include/executor/nodeHash.h           |  2 +-
 src/include/executor/nodeHashjoin.h       |  2 +-
 src/include/executor/nodeIndexonlyscan.h  |  2 +-
 src/include/executor/nodeIndexscan.h      |  2 +-
 src/include/executor/nodeLimit.h          |  2 +-
 src/include/executor/nodeLockRows.h       |  2 +-
 src/include/executor/nodeMaterial.h       |  2 +-
 src/include/executor/nodeMergeAppend.h    |  2 +-
 src/include/executor/nodeMergejoin.h      |  2 +-
 src/include/executor/nodeModifyTable.h    |  2 +-
 src/include/executor/nodeNestloop.h       |  2 +-
 src/include/executor/nodeRecursiveunion.h |  2 +-
 src/include/executor/nodeResult.h         |  2 +-
 src/include/executor/nodeSamplescan.h     |  2 +-
 src/include/executor/nodeSeqscan.h        |  2 +-
 src/include/executor/nodeSetOp.h          |  2 +-
 src/include/executor/nodeSort.h           |  2 +-
 src/include/executor/nodeSubqueryscan.h   |  2 +-
 src/include/executor/nodeTidscan.h        |  2 +-
 src/include/executor/nodeUnique.h         |  2 +-
 src/include/executor/nodeValuesscan.h     |  2 +-
 src/include/executor/nodeWindowAgg.h      |  2 +-
 src/include/executor/nodeWorktablescan.h  |  2 +-
 src/include/nodes/execnodes.h             |  2 +
 68 files changed, 360 insertions(+), 197 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 680ca4b..3f2ebff 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -380,6 +380,9 @@ ExecProcNode(PlanState *node)
 
 	CHECK_FOR_INTERRUPTS();
 
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
@@ -392,23 +395,23 @@ ExecProcNode(PlanState *node)
 			 * control nodes
 			 */
 		case T_ResultState:
-			result = ExecResult((ResultState *) node);
+			ExecResult((ResultState *) node);
 			break;
 
 		case T_ModifyTableState:
-			result = ExecModifyTable((ModifyTableState *) node);
+			ExecModifyTable((ModifyTableState *) node);
 			break;
 
 		case T_AppendState:
-			result = ExecAppend((AppendState *) node);
+			ExecAppend((AppendState *) node);
 			break;
 
 		case T_MergeAppendState:
-			result = ExecMergeAppend((MergeAppendState *) node);
+			ExecMergeAppend((MergeAppendState *) node);
 			break;
 
 		case T_RecursiveUnionState:
-			result = ExecRecursiveUnion((RecursiveUnionState *) node);
+			ExecRecursiveUnion((RecursiveUnionState *) node);
 			break;
 
 			/* BitmapAndState does not yield tuples */
@@ -419,119 +422,119 @@ ExecProcNode(PlanState *node)
 			 * scan nodes
 			 */
 		case T_SeqScanState:
-			result = ExecSeqScan((SeqScanState *) node);
+			ExecSeqScan((SeqScanState *) node);
 			break;
 
 		case T_SampleScanState:
-			result = ExecSampleScan((SampleScanState *) node);
+			ExecSampleScan((SampleScanState *) node);
 			break;
 
 		case T_IndexScanState:
-			result = ExecIndexScan((IndexScanState *) node);
+			ExecIndexScan((IndexScanState *) node);
 			break;
 
 		case T_IndexOnlyScanState:
-			result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
+			ExecIndexOnlyScan((IndexOnlyScanState *) node);
 			break;
 
 			/* BitmapIndexScanState does not yield tuples */
 
 		case T_BitmapHeapScanState:
-			result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
+			ExecBitmapHeapScan((BitmapHeapScanState *) node);
 			break;
 
 		case T_TidScanState:
-			result = ExecTidScan((TidScanState *) node);
+			ExecTidScan((TidScanState *) node);
 			break;
 
 		case T_SubqueryScanState:
-			result = ExecSubqueryScan((SubqueryScanState *) node);
+			ExecSubqueryScan((SubqueryScanState *) node);
 			break;
 
 		case T_FunctionScanState:
-			result = ExecFunctionScan((FunctionScanState *) node);
+			ExecFunctionScan((FunctionScanState *) node);
 			break;
 
 		case T_ValuesScanState:
-			result = ExecValuesScan((ValuesScanState *) node);
+			ExecValuesScan((ValuesScanState *) node);
 			break;
 
 		case T_CteScanState:
-			result = ExecCteScan((CteScanState *) node);
+			ExecCteScan((CteScanState *) node);
 			break;
 
 		case T_WorkTableScanState:
-			result = ExecWorkTableScan((WorkTableScanState *) node);
+			ExecWorkTableScan((WorkTableScanState *) node);
 			break;
 
 		case T_ForeignScanState:
-			result = ExecForeignScan((ForeignScanState *) node);
+			ExecForeignScan((ForeignScanState *) node);
 			break;
 
 		case T_CustomScanState:
-			result = ExecCustomScan((CustomScanState *) node);
+			ExecCustomScan((CustomScanState *) node);
 			break;
 
 			/*
 			 * join nodes
 			 */
 		case T_NestLoopState:
-			result = ExecNestLoop((NestLoopState *) node);
+			ExecNestLoop((NestLoopState *) node);
 			break;
 
 		case T_MergeJoinState:
-			result = ExecMergeJoin((MergeJoinState *) node);
+			ExecMergeJoin((MergeJoinState *) node);
 			break;
 
 		case T_HashJoinState:
-			result = ExecHashJoin((HashJoinState *) node);
+			ExecHashJoin((HashJoinState *) node);
 			break;
 
 			/*
 			 * materialization nodes
 			 */
 		case T_MaterialState:
-			result = ExecMaterial((MaterialState *) node);
+			ExecMaterial((MaterialState *) node);
 			break;
 
 		case T_SortState:
-			result = ExecSort((SortState *) node);
+			ExecSort((SortState *) node);
 			break;
 
 		case T_GroupState:
-			result = ExecGroup((GroupState *) node);
+			ExecGroup((GroupState *) node);
 			break;
 
 		case T_AggState:
-			result = ExecAgg((AggState *) node);
+			ExecAgg((AggState *) node);
 			break;
 
 		case T_WindowAggState:
-			result = ExecWindowAgg((WindowAggState *) node);
+			ExecWindowAgg((WindowAggState *) node);
 			break;
 
 		case T_UniqueState:
-			result = ExecUnique((UniqueState *) node);
+			ExecUnique((UniqueState *) node);
 			break;
 
 		case T_GatherState:
-			result = ExecGather((GatherState *) node);
+			ExecGather((GatherState *) node);
 			break;
 
 		case T_HashState:
-			result = ExecHash((HashState *) node);
+			ExecHash((HashState *) node);
 			break;
 
 		case T_SetOpState:
-			result = ExecSetOp((SetOpState *) node);
+			ExecSetOp((SetOpState *) node);
 			break;
 
 		case T_LockRowsState:
-			result = ExecLockRows((LockRowsState *) node);
+			ExecLockRows((LockRowsState *) node);
 			break;
 
 		case T_LimitState:
-			result = ExecLimit((LimitState *) node);
+			ExecLimit((LimitState *) node);
 			break;
 
 		default:
@@ -540,6 +543,14 @@ ExecProcNode(PlanState *node)
 			break;
 	}
 
+	/* We don't support asynchronous execution yet. */
+	Assert(node->result_ready);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	result = (TupleTableSlot *) node->result;
+
 	if (node->instrument)
 		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index fb0013d..095d40b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -99,7 +99,7 @@ ExecScanFetch(ScanState *node,
  *		ExecScan
  *
  *		Scans the relation using the 'access method' indicated and
- *		returns the next qualifying tuple in the direction specified
+ *		produces the next qualifying tuple in the direction specified
  *		in the global variable ExecDirection.
  *		The access method returns the next tuple and ExecScan() is
  *		responsible for checking the tuple returned against the qual-clause.
@@ -117,7 +117,7 @@ ExecScanFetch(ScanState *node,
  *			 "cursor" is positioned before the first qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecScan(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
 		 ExecScanRecheckMtd recheckMtd)
@@ -137,12 +137,14 @@ ExecScan(ScanState *node,
 
 	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
-	 * all the overhead and return the raw scan tuple.
+	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		ExecReturnTuple(&node->ps,
+						ExecScanFetch(node, accessMtd, recheckMtd));
+		return;
 	}
 
 	/*
@@ -155,7 +157,10 @@ ExecScan(ScanState *node,
 		Assert(projInfo);		/* can't get here if not projecting */
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -188,9 +193,10 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				return ExecClearTuple(projInfo->pi_slot);
+				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
 			else
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		/*
@@ -221,7 +227,8 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					return resultSlot;
+					ExecReturnTuple(&node->ps, resultSlot);
+					return;
 				}
 			}
 			else
@@ -229,7 +236,8 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2c11acb..e690dbd 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1797,7 +1797,7 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  *	  stored in the expression context to be used when ExecProject evaluates
  *	  the result tuple.
  */
-TupleTableSlot *
+void
 ExecAgg(AggState *node)
 {
 	TupleTableSlot *result;
@@ -1813,7 +1813,10 @@ ExecAgg(AggState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1823,6 +1826,7 @@ ExecAgg(AggState *node)
 	 * agg_done gets set before we emit the final aggregate tuple, and we have
 	 * to finish running SRFs for it.)
 	 */
+	result = NULL;
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -1837,12 +1841,9 @@ ExecAgg(AggState *node)
 				result = agg_retrieve_direct(node);
 				break;
 		}
-
-		if (!TupIsNull(result))
-			return result;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ss.ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index beb4ab8..e0ce8c6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -191,7 +191,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecAppend(AppendState *node)
 {
 	for (;;)
@@ -216,7 +216,8 @@ ExecAppend(AppendState *node)
 			 * NOT make use of the result slot that was set up in
 			 * ExecInitAppend; there's no need for it.
 			 */
-			return result;
+			ExecReturnTuple(&node->ps, result);
+			return;
 		}
 
 		/*
@@ -229,7 +230,11 @@ ExecAppend(AppendState *node)
 		else
 			node->as_whichplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			ExecReturnTuple(&node->ps,
+							ExecClearTuple(node->ps.ps_ResultTupleSlot));
+			return;
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2ba5cd0..31133ff 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -434,7 +434,7 @@ BitmapHeapRecheck(BitmapHeapScanState *node, TupleTableSlot *slot)
  *		ExecBitmapHeapScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecBitmapHeapScan(BitmapHeapScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 3c2f684..1f1fdf5 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -149,7 +149,7 @@ CteScanRecheck(CteScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecCteScan(CteScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 322abca..7162348 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -107,11 +107,11 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	return css;
 }
 
-TupleTableSlot *
+void
 ExecCustomScan(CustomScanState *node)
 {
 	Assert(node->methods->ExecCustomScan != NULL);
-	return node->methods->ExecCustomScan(node);
+	ExecReturnTuple(&node->ss.ps, node->methods->ExecCustomScan(node));
 }
 
 void
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 7d9160d..1f3e072 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -113,7 +113,7 @@ ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecForeignScan(ForeignScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index a03f6e7..3cccd8f 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -262,7 +262,7 @@ FunctionRecheck(FunctionScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecFunctionScan(FunctionScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 6da52b3..508ff75 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -126,7 +126,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
  *		the next qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecGather(GatherState *node)
 {
 	TupleTableSlot *fslot = node->funnel_slot;
@@ -207,7 +207,10 @@ ExecGather(GatherState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -229,7 +232,10 @@ ExecGather(GatherState *node)
 		 */
 		slot = gather_getnext(node);
 		if (TupIsNull(slot))
-			return NULL;
+		{
+			ExecReturnTuple(&node->ps, NULL);
+			return;
+		}
 
 		/*
 		 * form the result tuple using ExecProject(), and return it --- unless
@@ -242,11 +248,12 @@ ExecGather(GatherState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 3c066fc..f33a316 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -31,7 +31,7 @@
  *
  *		Return one tuple for each group of matching input tuples.
  */
-TupleTableSlot *
+void
 ExecGroup(GroupState *node)
 {
 	ExprContext *econtext;
@@ -44,7 +44,10 @@ ExecGroup(GroupState *node)
 	 * get state info from node
 	 */
 	if (node->grp_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ss.ps, NULL);
+		return;
+	}
 	econtext = node->ss.ps.ps_ExprContext;
 	numCols = ((Group *) node->ss.ps.plan)->numCols;
 	grpColIdx = ((Group *) node->ss.ps.plan)->grpColIdx;
@@ -61,7 +64,10 @@ ExecGroup(GroupState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -87,7 +93,8 @@ ExecGroup(GroupState *node)
 		{
 			/* empty input, so return nothing */
 			node->grp_done = TRUE;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 		/* Copy tuple into firsttupleslot */
 		ExecCopySlot(firsttupleslot, outerslot);
@@ -115,7 +122,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
@@ -139,7 +147,8 @@ ExecGroup(GroupState *node)
 			{
 				/* no more groups, so we're done */
 				node->grp_done = TRUE;
-				return NULL;
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
 			}
 
 			/*
@@ -178,7 +187,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5e78de0..905eb30 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -56,11 +56,10 @@ static void *dense_alloc(HashJoinTable hashtable, Size size);
  *		stub for pro forma compliance
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecHash(HashState *node)
 {
 	elog(ERROR, "Hash node does not support ExecProcNode call convention");
-	return NULL;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index a7a908a..cc92fc3 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -58,7 +58,7 @@ static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecHashJoin(HashJoinState *node)
 {
 	PlanState  *outerNode;
@@ -93,7 +93,10 @@ ExecHashJoin(HashJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -155,7 +158,8 @@ ExecHashJoin(HashJoinState *node)
 					if (TupIsNull(node->hj_FirstOuterTupleSlot))
 					{
 						node->hj_OuterNotEmpty = false;
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 					}
 					else
 						node->hj_OuterNotEmpty = true;
@@ -183,7 +187,10 @@ ExecHashJoin(HashJoinState *node)
 				 * outer relation.
 				 */
 				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
-					return NULL;
+				{
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
+				}
 
 				/*
 				 * need to remember whether nbatch has increased since we
@@ -323,7 +330,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -362,7 +370,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -401,7 +410,8 @@ ExecHashJoin(HashJoinState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -414,7 +424,10 @@ ExecHashJoin(HashJoinState *node)
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecHashJoinNewBatch(node))
-					return NULL;	/* end of join */
+				{
+					ExecReturnTuple(&node->js.ps, NULL); /* end of join */
+					return;
+				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..47285a1 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -249,7 +249,7 @@ IndexOnlyRecheck(IndexOnlyScanState *node, TupleTableSlot *slot)
  *		ExecIndexOnlyScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexOnlyScan(IndexOnlyScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..6bf35d3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -482,7 +482,7 @@ reorderqueue_pop(IndexScanState *node)
  *		ExecIndexScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexScan(IndexScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 97267c5..4e70183 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -36,7 +36,7 @@ static void pass_down_bound(LimitState *node, PlanState *child_node);
  *		filtering on the stream of tuples returned by a subplan.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLimit(LimitState *node)
 {
 	ScanDirection direction;
@@ -72,7 +72,10 @@ ExecLimit(LimitState *node)
 			 * If backwards scan, just return NULL without changing state.
 			 */
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
@@ -80,7 +83,8 @@ ExecLimit(LimitState *node)
 			if (node->count <= 0 && !node->noCount)
 			{
 				node->lstate = LIMIT_EMPTY;
-				return NULL;
+				ExecReturnTuple(&node->ps, NULL);
+				return;
 			}
 
 			/*
@@ -96,7 +100,8 @@ ExecLimit(LimitState *node)
 					 * any output at all.
 					 */
 					node->lstate = LIMIT_EMPTY;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				if (++node->position > node->offset)
@@ -115,7 +120,8 @@ ExecLimit(LimitState *node)
 			 * The subplan is known to return no tuples (or not more than
 			 * OFFSET tuples, in general).  So we return no tuples.
 			 */
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
@@ -130,7 +136,8 @@ ExecLimit(LimitState *node)
 					node->position - node->offset >= node->count)
 				{
 					node->lstate = LIMIT_WINDOWEND;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -140,7 +147,8 @@ ExecLimit(LimitState *node)
 				if (TupIsNull(slot))
 				{
 					node->lstate = LIMIT_SUBPLANEOF;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				node->position++;
@@ -154,7 +162,8 @@ ExecLimit(LimitState *node)
 				if (node->position <= node->offset + 1)
 				{
 					node->lstate = LIMIT_WINDOWSTART;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -170,7 +179,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_SUBPLANEOF:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from subplan EOF, so re-fetch previous tuple; there
@@ -186,7 +198,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWEND:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from window end: simply re-return the last tuple
@@ -199,7 +214,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWSTART:
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Advancing after having backed off window start: simply
@@ -220,7 +238,7 @@ ExecLimit(LimitState *node)
 	/* Return the current tuple */
 	Assert(!TupIsNull(slot));
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /*
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index c4b5333..8daa203 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -35,7 +35,7 @@
  *		ExecLockRows
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLockRows(LockRowsState *node)
 {
 	TupleTableSlot *slot;
@@ -57,7 +57,10 @@ lnext:
 	slot = ExecProcNode(outerPlan);
 
 	if (TupIsNull(slot))
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
 	epq_needed = false;
@@ -334,7 +337,7 @@ lnext:
 	}
 
 	/* Got all locks, so return the current tuple */
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 82e31c1..fd3b013 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -35,7 +35,7 @@
  *
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* result tuple from subplan */
+void
 ExecMaterial(MaterialState *node)
 {
 	EState	   *estate;
@@ -93,7 +93,11 @@ ExecMaterial(MaterialState *node)
 			 * fetch.
 			 */
 			if (!tuplestore_advance(tuplestorestate, forward))
-				return NULL;	/* the tuplestore must be empty */
+			{
+				/* the tuplestore must be empty */
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
+			}
 		}
 		eof_tuplestore = false;
 	}
@@ -105,7 +109,10 @@ ExecMaterial(MaterialState *node)
 	if (!eof_tuplestore)
 	{
 		if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))
-			return slot;
+		{
+			ExecReturnTuple(&node->ss.ps, slot);
+			return;
+		}
 		if (forward)
 			eof_tuplestore = true;
 	}
@@ -132,7 +139,8 @@ ExecMaterial(MaterialState *node)
 		if (TupIsNull(outerslot))
 		{
 			node->eof_underlying = true;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 
 		/*
@@ -146,13 +154,14 @@ ExecMaterial(MaterialState *node)
 		/*
 		 * We can just return the subplan's returned tuple, without copying.
 		 */
-		return outerslot;
+		ExecReturnTuple(&node->ss.ps, outerslot);
+		return;
 	}
 
 	/*
 	 * Nothing left ...
 	 */
-	return ExecClearTuple(slot);
+	ExecReturnTuple(&node->ss.ps, ExecClearTuple(slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index ae0e8dc..3ef8120 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -164,7 +164,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeAppend(MergeAppendState *node)
 {
 	TupleTableSlot *result;
@@ -214,7 +214,7 @@ ExecMergeAppend(MergeAppendState *node)
 		result = node->ms_slots[i];
 	}
 
-	return result;
+	ExecReturnTuple(&node->ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index cd8d6c6..d73d9f4 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -615,7 +615,7 @@ ExecMergeTupleDump(MergeJoinState *mergestate)
  *		ExecMergeJoin
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeJoin(MergeJoinState *node)
 {
 	List	   *joinqual;
@@ -653,7 +653,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -710,7 +713,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillOuter(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -728,7 +734,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -765,7 +772,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillInner(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -785,7 +795,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -868,7 +879,8 @@ ExecMergeJoin(MergeJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -901,7 +913,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1003,7 +1018,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1039,7 +1057,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1174,7 +1193,8 @@ ExecMergeJoin(MergeJoinState *node)
 								break;
 							}
 							/* Otherwise we're done. */
-							return NULL;
+							ExecReturnTuple(&node->js.ps, NULL);
+							return;
 					}
 				}
 				break;
@@ -1256,7 +1276,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1292,7 +1315,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1318,7 +1342,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1362,7 +1389,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1388,7 +1416,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1406,7 +1437,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(innerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of inner subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDOUTER state and process next tuple. */
@@ -1434,7 +1466,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1448,7 +1483,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(outerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of outer subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDINNER state and process next tuple. */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95cc2c6..0e05d4d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1298,7 +1298,7 @@ fireASTriggers(ModifyTableState *node)
  *		if needed.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecModifyTable(ModifyTableState *node)
 {
 	EState	   *estate = node->ps.state;
@@ -1333,7 +1333,10 @@ ExecModifyTable(ModifyTableState *node)
 	 * extra times.
 	 */
 	if (node->mt_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/*
 	 * On first call, fire BEFORE STATEMENT triggers before proceeding.
@@ -1411,7 +1414,8 @@ ExecModifyTable(ModifyTableState *node)
 			slot = ExecProcessReturning(resultRelInfo, NULL, planSlot);
 
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		EvalPlanQualSetSlot(&node->mt_epqstate, planSlot);
@@ -1517,7 +1521,8 @@ ExecModifyTable(ModifyTableState *node)
 		if (slot)
 		{
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 	}
 
@@ -1531,7 +1536,7 @@ ExecModifyTable(ModifyTableState *node)
 
 	node->mt_done = true;
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 1895b60..54eff56 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -56,7 +56,7 @@
  *			   are prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecNestLoop(NestLoopState *node)
 {
 	NestLoop   *nl;
@@ -93,7 +93,10 @@ ExecNestLoop(NestLoopState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -128,7 +131,8 @@ ExecNestLoop(NestLoopState *node)
 			if (TupIsNull(outerTupleSlot))
 			{
 				ENL1_printf("no outer tuple, ending join");
-				return NULL;
+				ExecReturnTuple(&node->js.ps, NULL);
+				return;
 			}
 
 			ENL1_printf("saving new outer tuple information");
@@ -212,7 +216,8 @@ ExecNestLoop(NestLoopState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -270,7 +275,8 @@ ExecNestLoop(NestLoopState *node)
 				{
 					node->js.ps.ps_TupFromTlist =
 						(isDone == ExprMultipleResult);
-					return result;
+					ExecReturnTuple(&node->js.ps, result);
+					return;
 				}
 			}
 			else
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 2328ef3..6e78eb2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -72,7 +72,7 @@ build_hash_table(RecursiveUnionState *rustate)
  * 2.6 go back to 2.2
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecRecursiveUnion(RecursiveUnionState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
@@ -102,7 +102,8 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 			/* Each non-duplicate tuple goes to the working table ... */
 			tuplestore_puttupleslot(node->working_table, slot);
 			/* ... and to the caller */
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 		node->recursing = true;
 	}
@@ -151,10 +152,11 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 		node->intermediate_empty = false;
 		tuplestore_puttupleslot(node->intermediate_table, slot);
 		/* ... and return it */
-		return slot;
+		ExecReturnTuple(&node->ps, slot);
+		return;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 0d2de14..a830ffd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -63,7 +63,7 @@
  *		'nil' if the constant qualification is not satisfied.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecResult(ResultState *node)
 {
 	TupleTableSlot *outerTupleSlot;
@@ -87,7 +87,8 @@ ExecResult(ResultState *node)
 		if (!qualResult)
 		{
 			node->rs_done = true;
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 		}
 	}
 
@@ -100,7 +101,10 @@ ExecResult(ResultState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -130,7 +134,10 @@ ExecResult(ResultState *node)
 			outerTupleSlot = ExecProcNode(outerPlan);
 
 			if (TupIsNull(outerTupleSlot))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * prepare to compute projection expressions, which will expect to
@@ -157,11 +164,12 @@ ExecResult(ResultState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 9ce7c02..89cce0e 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -95,7 +95,7 @@ SampleRecheck(SampleScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSampleScan(SampleScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 00bf3a5..0ca86d9 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -121,7 +121,7 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSeqScan(SeqScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7a3b67c..b7a593f 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -191,7 +191,7 @@ set_output_count(SetOpState *setopstate, SetOpStatePerGroup pergroup)
  *		ExecSetOp
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecSetOp(SetOpState *node)
 {
 	SetOp	   *plannode = (SetOp *) node->ps.plan;
@@ -204,22 +204,26 @@ ExecSetOp(SetOpState *node)
 	if (node->numOutput > 0)
 	{
 		node->numOutput--;
-		return resultTupleSlot;
+		ExecReturnTuple(&node->ps, resultTupleSlot);
+		return;
 	}
 
 	/* Otherwise, we're done if we are out of groups */
 	if (node->setop_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* Fetch the next tuple group according to the correct strategy */
 	if (plannode->strategy == SETOP_HASHED)
 	{
 		if (!node->table_filled)
 			setop_fill_hash_table(node);
-		return setop_retrieve_hash_table(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_hash_table(node));
 	}
 	else
-		return setop_retrieve_direct(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_direct(node));
 }
 
 /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 0286a7f..13f721a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -35,7 +35,7 @@
  *		  -- the outer child is prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSort(SortState *node)
 {
 	EState	   *estate;
@@ -138,7 +138,7 @@ ExecSort(SortState *node)
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  slot, NULL);
-	return slot;
+	ExecReturnTuple(&node->ss.ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index cb007a5..0562926 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -79,7 +79,7 @@ SubqueryRecheck(SubqueryScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSubqueryScan(SubqueryScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 2604103..e2a0479 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -387,7 +387,7 @@ TidRecheck(TidScanState *node, TupleTableSlot *slot)
  *		  -- tidPtr is -1.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecTidScan(TidScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 5d13a89..2daa001 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -42,7 +42,7 @@
  *		ExecUnique
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecUnique(UniqueState *node)
 {
 	Unique	   *plannode = (Unique *) node->ps.plan;
@@ -70,8 +70,8 @@ ExecUnique(UniqueState *node)
 		if (TupIsNull(slot))
 		{
 			/* end of subplan, so we're done */
-			ExecClearTuple(resultTupleSlot);
-			return NULL;
+			ExecReturnTuple(&node->ps, ExecClearTuple(resultTupleSlot));
+			return;
 		}
 
 		/*
@@ -98,7 +98,7 @@ ExecUnique(UniqueState *node)
 	 * won't guarantee that this source tuple is still accessible after
 	 * fetching the next source tuple.
 	 */
-	return ExecCopySlot(resultTupleSlot, slot);
+	ExecReturnTuple(&node->ps, ExecCopySlot(resultTupleSlot, slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 9c03f8a..3e6c321 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -186,7 +186,7 @@ ValuesRecheck(ValuesScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecValuesScan(ValuesScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index bae713b..62fe48b 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1555,7 +1555,7 @@ update_frametailpos(WindowObject winobj, TupleTableSlot *slot)
  *	(ignoring the case of SRFs in the targetlist, that is).
  * -----------------
  */
-TupleTableSlot *
+void
 ExecWindowAgg(WindowAggState *winstate)
 {
 	TupleTableSlot *result;
@@ -1565,7 +1565,10 @@ ExecWindowAgg(WindowAggState *winstate)
 	int			numfuncs;
 
 	if (winstate->all_done)
-		return NULL;
+	{
+		ExecReturnTuple(&winstate->ss.ps, NULL);
+		return;
+	}
 
 	/*
 	 * Check to see if we're still projecting out tuples from a previous
@@ -1579,7 +1582,10 @@ ExecWindowAgg(WindowAggState *winstate)
 
 		result = ExecProject(winstate->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&winstate->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		winstate->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1687,7 +1693,8 @@ restart:
 		else
 		{
 			winstate->all_done = true;
-			return NULL;
+			ExecReturnTuple(&winstate->ss.ps, NULL);
+			return;
 		}
 	}
 
@@ -1753,7 +1760,7 @@ restart:
 
 	winstate->ss.ps.ps_TupFromTlist =
 		(isDone == ExprMultipleResult);
-	return result;
+	ExecReturnTuple(&winstate->ss.ps, result);
 }
 
 /* -----------------
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index cfed6e6..c3615b2 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -77,7 +77,7 @@ WorkTableScanRecheck(WorkTableScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecWorkTableScan(WorkTableScanState *node)
 {
 	/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 28c0c2e..1eb09d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -228,6 +228,15 @@ extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
+/* Convenience function to set a node's result to a TupleTableSlot. */
+static inline void
+ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
+{
+	Assert(!node->result_ready);
+	node->result = (Node *) slot;
+	node->result_ready = true;
+}
+
 /*
  * prototypes from functions in execQual.c
  */
@@ -256,7 +265,7 @@ extern TupleTableSlot *ExecProject(ProjectionInfo *projInfo,
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
 
-extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
+extern void ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 		 ExecScanRecheckMtd recheckMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, Index varno);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 54c75e8..b86ec6a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAgg(AggState *node);
+extern void ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..70a6b62 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AppendState *ExecInitAppend(Append *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAppend(AppendState *node);
+extern void ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 0ed9c78..069dbc7 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern BitmapHeapScanState *ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecBitmapHeapScan(BitmapHeapScanState *node);
+extern void ExecBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecEndBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecReScanBitmapHeapScan(BitmapHeapScanState *node);
 
diff --git a/src/include/executor/nodeCtescan.h b/src/include/executor/nodeCtescan.h
index ef5c2bc..8411fa1 100644
--- a/src/include/executor/nodeCtescan.h
+++ b/src/include/executor/nodeCtescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern CteScanState *ExecInitCteScan(CteScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecCteScan(CteScanState *node);
+extern void ExecCteScan(CteScanState *node);
 extern void ExecEndCteScan(CteScanState *node);
 extern void ExecReScanCteScan(CteScanState *node);
 
diff --git a/src/include/executor/nodeCustom.h b/src/include/executor/nodeCustom.h
index 7d16c2b..5df2ebb 100644
--- a/src/include/executor/nodeCustom.h
+++ b/src/include/executor/nodeCustom.h
@@ -21,7 +21,7 @@
  */
 extern CustomScanState *ExecInitCustomScan(CustomScan *custom_scan,
 				   EState *estate, int eflags);
-extern TupleTableSlot *ExecCustomScan(CustomScanState *node);
+extern void ExecCustomScan(CustomScanState *node);
 extern void ExecEndCustomScan(CustomScanState *node);
 
 extern void ExecReScanCustomScan(CustomScanState *node);
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3d0f7bd 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecForeignScan(ForeignScanState *node);
+extern void ExecForeignScan(ForeignScanState *node);
 extern void ExecEndForeignScan(ForeignScanState *node);
 extern void ExecReScanForeignScan(ForeignScanState *node);
 
diff --git a/src/include/executor/nodeFunctionscan.h b/src/include/executor/nodeFunctionscan.h
index d6e7a61..15beb13 100644
--- a/src/include/executor/nodeFunctionscan.h
+++ b/src/include/executor/nodeFunctionscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern FunctionScanState *ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecFunctionScan(FunctionScanState *node);
+extern void ExecFunctionScan(FunctionScanState *node);
 extern void ExecEndFunctionScan(FunctionScanState *node);
 extern void ExecReScanFunctionScan(FunctionScanState *node);
 
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
index f76d9be..100a827 100644
--- a/src/include/executor/nodeGather.h
+++ b/src/include/executor/nodeGather.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGather(GatherState *node);
+extern void ExecGather(GatherState *node);
 extern void ExecEndGather(GatherState *node);
 extern void ExecShutdownGather(GatherState *node);
 extern void ExecReScanGather(GatherState *node);
diff --git a/src/include/executor/nodeGroup.h b/src/include/executor/nodeGroup.h
index 92639f5..446ded5 100644
--- a/src/include/executor/nodeGroup.h
+++ b/src/include/executor/nodeGroup.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GroupState *ExecInitGroup(Group *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGroup(GroupState *node);
+extern void ExecGroup(GroupState *node);
 extern void ExecEndGroup(GroupState *node);
 extern void ExecReScanGroup(GroupState *node);
 
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 8cf6d15..b395fd9 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHash(HashState *node);
+extern void ExecHash(HashState *node);
 extern Node *MultiExecHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index f24127a..072c610 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -18,7 +18,7 @@
 #include "storage/buffile.h"
 
 extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+extern void ExecHashJoin(HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
diff --git a/src/include/executor/nodeIndexonlyscan.h b/src/include/executor/nodeIndexonlyscan.h
index d63d194..0fbcf80 100644
--- a/src/include/executor/nodeIndexonlyscan.h
+++ b/src/include/executor/nodeIndexonlyscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexOnlyScanState *ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexOnlyScan(IndexOnlyScanState *node);
+extern void ExecIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecEndIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecIndexOnlyMarkPos(IndexOnlyScanState *node);
 extern void ExecIndexOnlyRestrPos(IndexOnlyScanState *node);
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..341dab3 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
+extern void ExecIndexScan(IndexScanState *node);
 extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 96166b4..03dde30 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLimit(LimitState *node);
+extern void ExecLimit(LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/executor/nodeLockRows.h b/src/include/executor/nodeLockRows.h
index e828e9c..eda3cbec 100644
--- a/src/include/executor/nodeLockRows.h
+++ b/src/include/executor/nodeLockRows.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LockRowsState *ExecInitLockRows(LockRows *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLockRows(LockRowsState *node);
+extern void ExecLockRows(LockRowsState *node);
 extern void ExecEndLockRows(LockRowsState *node);
 extern void ExecReScanLockRows(LockRowsState *node);
 
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index 2b8cae1..20bc7f6 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMaterial(MaterialState *node);
+extern void ExecMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
 extern void ExecMaterialRestrPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergeAppend.h b/src/include/executor/nodeMergeAppend.h
index 0efc489..e43b5e6 100644
--- a/src/include/executor/nodeMergeAppend.h
+++ b/src/include/executor/nodeMergeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeAppendState *ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeAppend(MergeAppendState *node);
+extern void ExecMergeAppend(MergeAppendState *node);
 extern void ExecEndMergeAppend(MergeAppendState *node);
 extern void ExecReScanMergeAppend(MergeAppendState *node);
 
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index 74d691c..dfdbc1b 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
+extern void ExecMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 6b66353..fe67248 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -16,7 +16,7 @@
 #include "nodes/execnodes.h"
 
 extern ModifyTableState *ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecModifyTable(ModifyTableState *node);
+extern void ExecModifyTable(ModifyTableState *node);
 extern void ExecEndModifyTable(ModifyTableState *node);
 extern void ExecReScanModifyTable(ModifyTableState *node);
 
diff --git a/src/include/executor/nodeNestloop.h b/src/include/executor/nodeNestloop.h
index eeb42d6..cab1885 100644
--- a/src/include/executor/nodeNestloop.h
+++ b/src/include/executor/nodeNestloop.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern NestLoopState *ExecInitNestLoop(NestLoop *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecNestLoop(NestLoopState *node);
+extern void ExecNestLoop(NestLoopState *node);
 extern void ExecEndNestLoop(NestLoopState *node);
 extern void ExecReScanNestLoop(NestLoopState *node);
 
diff --git a/src/include/executor/nodeRecursiveunion.h b/src/include/executor/nodeRecursiveunion.h
index 1c08790..fb11eca 100644
--- a/src/include/executor/nodeRecursiveunion.h
+++ b/src/include/executor/nodeRecursiveunion.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern RecursiveUnionState *ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecRecursiveUnion(RecursiveUnionState *node);
+extern void ExecRecursiveUnion(RecursiveUnionState *node);
 extern void ExecEndRecursiveUnion(RecursiveUnionState *node);
 extern void ExecReScanRecursiveUnion(RecursiveUnionState *node);
 
diff --git a/src/include/executor/nodeResult.h b/src/include/executor/nodeResult.h
index 356027f..951fae6 100644
--- a/src/include/executor/nodeResult.h
+++ b/src/include/executor/nodeResult.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ResultState *ExecInitResult(Result *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecResult(ResultState *node);
+extern void ExecResult(ResultState *node);
 extern void ExecEndResult(ResultState *node);
 extern void ExecResultMarkPos(ResultState *node);
 extern void ExecResultRestrPos(ResultState *node);
diff --git a/src/include/executor/nodeSamplescan.h b/src/include/executor/nodeSamplescan.h
index c8f03d8..4ab6e5a 100644
--- a/src/include/executor/nodeSamplescan.h
+++ b/src/include/executor/nodeSamplescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SampleScanState *ExecInitSampleScan(SampleScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSampleScan(SampleScanState *node);
+extern void ExecSampleScan(SampleScanState *node);
 extern void ExecEndSampleScan(SampleScanState *node);
 extern void ExecReScanSampleScan(SampleScanState *node);
 
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index f2e61ff..816d1a5 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern void ExecSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index c6e9603..dd88afb 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSetOp(SetOpState *node);
+extern void ExecSetOp(SetOpState *node);
 extern void ExecEndSetOp(SetOpState *node);
 extern void ExecReScanSetOp(SetOpState *node);
 
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 481065f..f65037d 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSort(SortState *node);
+extern void ExecSort(SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/executor/nodeSubqueryscan.h b/src/include/executor/nodeSubqueryscan.h
index 427699b..a3962c7 100644
--- a/src/include/executor/nodeSubqueryscan.h
+++ b/src/include/executor/nodeSubqueryscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SubqueryScanState *ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSubqueryScan(SubqueryScanState *node);
+extern void ExecSubqueryScan(SubqueryScanState *node);
 extern void ExecEndSubqueryScan(SubqueryScanState *node);
 extern void ExecReScanSubqueryScan(SubqueryScanState *node);
 
diff --git a/src/include/executor/nodeTidscan.h b/src/include/executor/nodeTidscan.h
index 76c2a9f..5b7bbfd 100644
--- a/src/include/executor/nodeTidscan.h
+++ b/src/include/executor/nodeTidscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern TidScanState *ExecInitTidScan(TidScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecTidScan(TidScanState *node);
+extern void ExecTidScan(TidScanState *node);
 extern void ExecEndTidScan(TidScanState *node);
 extern void ExecReScanTidScan(TidScanState *node);
 
diff --git a/src/include/executor/nodeUnique.h b/src/include/executor/nodeUnique.h
index aa8491d..b53a553 100644
--- a/src/include/executor/nodeUnique.h
+++ b/src/include/executor/nodeUnique.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern UniqueState *ExecInitUnique(Unique *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecUnique(UniqueState *node);
+extern void ExecUnique(UniqueState *node);
 extern void ExecEndUnique(UniqueState *node);
 extern void ExecReScanUnique(UniqueState *node);
 
diff --git a/src/include/executor/nodeValuesscan.h b/src/include/executor/nodeValuesscan.h
index 026f261..90288fc 100644
--- a/src/include/executor/nodeValuesscan.h
+++ b/src/include/executor/nodeValuesscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ValuesScanState *ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecValuesScan(ValuesScanState *node);
+extern void ExecValuesScan(ValuesScanState *node);
 extern void ExecEndValuesScan(ValuesScanState *node);
 extern void ExecReScanValuesScan(ValuesScanState *node);
 
diff --git a/src/include/executor/nodeWindowAgg.h b/src/include/executor/nodeWindowAgg.h
index 94ed037..f5e2c98 100644
--- a/src/include/executor/nodeWindowAgg.h
+++ b/src/include/executor/nodeWindowAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WindowAggState *ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWindowAgg(WindowAggState *node);
+extern void ExecWindowAgg(WindowAggState *node);
 extern void ExecEndWindowAgg(WindowAggState *node);
 extern void ExecReScanWindowAgg(WindowAggState *node);
 
diff --git a/src/include/executor/nodeWorktablescan.h b/src/include/executor/nodeWorktablescan.h
index 217208a..7b1eecb 100644
--- a/src/include/executor/nodeWorktablescan.h
+++ b/src/include/executor/nodeWorktablescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WorkTableScanState *ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWorkTableScan(WorkTableScanState *node);
+extern void ExecWorkTableScan(WorkTableScanState *node);
 extern void ExecEndWorkTableScan(WorkTableScanState *node);
 extern void ExecReScanWorkTableScan(WorkTableScanState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4b18436..ff6c453 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1031,6 +1031,8 @@ typedef struct PlanState
 								 * top-level plan */
 
 	struct PlanState *parent;	/* node which will receive tuples from us */
+	bool		result_ready;	/* true if result is ready */
+	Node	   *result;			/* result, most often TupleTableSlot */
 
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
-- 
1.8.3.1

0003-Lightweight-framework-for-waiting-for-events.patchtext/x-patch; charset=us-asciiDownload
From ca98941c513a62dd98bb9321d3a333804e9c4217 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 9 May 2016 11:48:11 -0400
Subject: [PATCH 3/7] Lightweight framework for waiting for events.

---
 src/backend/executor/Makefile       |   4 +-
 src/backend/executor/execAsync.c    | 256 ++++++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c |  82 ++++++++----
 src/include/executor/execAsync.h    |  23 ++++
 src/include/executor/executor.h     |   2 +
 src/include/nodes/execnodes.h       |  10 ++
 6 files changed, 352 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..20601fa
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,256 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * This file contains routines that are intended to asynchronous
+ * execution; that is, suspending an executor node until some external
+ * event occurs, or until one of its child nodes produces a tuple.
+ * This allows the executor to avoid blocking on a single external event,
+ * such as a file descriptor waiting on I/O, or a parallel worker which
+ * must complete work elsewhere in the plan tree, when there might at the
+ * same time be useful computation that could be accomplished in some
+ * other part of the plan tree.
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/executor.h"
+#include "storage/latch.h"
+
+#define	EVENT_BUFFER_SIZE		16
+
+static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+
+void
+ExecAsyncWaitForNode(PlanState *planstate)
+{
+	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
+	PlanState  *callbacks[EVENT_BUFFER_SIZE];
+	int			ncallbacks = 0;
+	EState *estate = planstate->state;
+
+	while (!planstate->result_ready)
+	{
+		bool	reinit = (estate->es_wait_event_set == NULL);
+		int		n;
+		int		noccurred;
+
+		if (reinit)
+		{
+			/*
+			 * Allow for a few extra events without reinitializing.  It
+			 * doesn't seem worth the complexity of doing anything very
+			 * aggressive here, because plans that depend on massive numbers
+			 * of external FDs are likely to run afoul of kernel limits anyway.
+			 */
+			estate->es_max_async_events = estate->es_total_async_events + 16;
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_max_async_events);
+		}
+
+		/* Give each waiting node a chance to add or modify events. */
+		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+
+		/* Wait for at least one event to occur. */
+		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+									 occurred_event, EVENT_BUFFER_SIZE);
+		Assert(noccurred > 0);
+
+		/*
+		 * Loop over the occurred events and make a list of nodes that need
+		 * a callback.  The waiting nodes should have registered their wait
+		 * events with user_data pointing back to the node.
+		 */
+		for (n = 0; n < noccurred; ++n)
+		{
+			WaitEvent  *w = &occurred_event[n];
+			PlanState  *ps = w->user_data;
+
+			callbacks[ncallbacks++] = ps;
+		}
+
+		/*
+		 * Initially, this loop will call the node-type-specific function for
+		 * each node for which an event occurred.  If any of those nodes
+		 * produce a result, its parent enters the set of nodes that are
+		 * pending for a callback.  In this way, when a result becomes
+		 * available in a leaf of the plan tree, it can bubble upwards towards
+		 * the root as far as necessary.
+		 */
+		while (ncallbacks > 0)
+		{
+			int		i,
+					j;
+
+			/* Loop over all callbacks. */
+			for (i = 0; i < ncallbacks; ++i)
+			{
+				/* Skip if NULL. */
+				if (callbacks[i] == NULL)
+					continue;
+
+				/*
+				 * Remove any duplicates.  O(n) may not seem good, but it
+				 * should hopefully be OK as long as EVENT_BUFFER_SIZE is
+				 * not too large.
+				 */
+				for (j = i + 1; j < ncallbacks; ++j)
+					if (callbacks[i] == callbacks[j])
+						callbacks[j] = NULL;
+
+				/* Dispatch to node-type-specific code. */
+				ExecDispatchNode(callbacks[i]);
+
+				/*
+				 * If there's now a tuple ready, we must dispatch to the
+				 * parent node; otherwise, there's nothing more to do.
+				 */
+				if (callbacks[i]->result_ready)
+					callbacks[i] = callbacks[i]->parent;
+				else
+					callbacks[i] = NULL;
+			}
+
+			/* Squeeze out NULLs. */
+			for (i = 0, j = 0; j < ncallbacks; ++j)
+				if (callbacks[j] != NULL)
+					callbacks[i++] = callbacks[j];
+			ncallbacks = i;
+		}
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one more or events that can be registered on a WaitEventSet.  nevents
+ * should be the maximum number of events that it will wish to register.
+ * reinit should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
+{
+	EState *estate = planstate->state;
+
+	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
+
+	/*
+	 * If this node is not already present in the array of waiting nodes,
+	 * then add it.  If that array hasn't been allocated or is full, this may
+	 * require (re)allocating it.
+	 */
+	if (planstate->n_async_events == 0)
+	{
+		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		{
+			int		newmax;
+
+			if (estate->es_max_waiting_nodes == 0)
+			{
+				newmax = 16;
+				estate->es_waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt, newmax);
+			}
+			else
+			{
+				newmax = estate->es_max_waiting_nodes * 2;
+				estate->es_waiting_nodes =
+					repalloc(estate->es_waiting_nodes,
+							 newmax * sizeof(PlanState *));
+			}
+			estate->es_max_waiting_nodes = newmax;
+		}
+		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+	}
+
+	/* Adjust per-node and per-estate totals. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = nevents;
+	estate->es_total_async_events += planstate->n_async_events;
+
+	/*
+	 * If a WaitEventSet has already been created, we need to discard it and
+	 * start again if the user passed reinit = true, or if the total number of
+	 * required events exceeds the supported number.
+	 */
+	if (estate->es_wait_event_set != NULL && (reinit ||
+		estate->es_total_async_events > estate->es_max_async_events))
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * If an executor node no longer needs to wait, it should call this function
+ * to report that fact.
+ */
+void
+ExecAsyncDoesNotNeedWait(PlanState *planstate)
+{
+	int		n;
+	EState *estate = planstate->state;
+
+	if (planstate->n_async_events <= 0)
+		return;
+
+	/*
+	 * Remove the node from the list of waiting nodes.  (Is a linear search
+	 * going to be a problem here?  I think probably not.)
+	 */
+	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	{
+		if (estate->es_waiting_nodes[n] == planstate)
+		{
+			estate->es_waiting_nodes[n] =
+				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+			break;
+		}
+	}
+
+	/* We should always find ourselves in the array. */
+	Assert(n < estate->es_num_waiting_nodes);
+
+	/* We no longer need any asynchronous events. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = 0;
+
+	/*
+	 * The next wait will need to rebuild the WaitEventSet, because whatever
+	 * events we registered are gone now.  It's probably OK that this code
+	 * assumes we actually did register some events at one point, because we
+	 * needed to wait at some point and we don't any more.
+	 */
+	if (estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * Give per-nodetype function a chance to register wait events.
+ */
+static void
+ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+{
+	switch (nodeTag(planstate))
+	{
+		/* XXX: Add calls to per-nodetype handlers here. */
+		default:
+			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
+	}
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3f2ebff..b7ac08e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -77,6 +77,7 @@
  */
 #include "postgres.h"
 
+#include "executor/execAsync.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "executor/nodeAppend.h"
@@ -368,24 +369,14 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 
 
 /* ----------------------------------------------------------------
- *		ExecProcNode
+ *		ExecDispatchNode
  *
- *		Execute the given node to return a(nother) tuple.
+ *		Invoke the given node's dispatch function.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
-ExecProcNode(PlanState *node)
+void
+ExecDispatchNode(PlanState *node)
 {
-	TupleTableSlot *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -539,22 +530,67 @@ ExecProcNode(PlanState *node)
 
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
 			break;
 	}
 
-	/* We don't support asynchronous execution yet. */
-	Assert(node->result_ready);
+	if (node->instrument)
+	{
+		double	nTuples = 0.0;
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+		if (node->result_ready && node->result != NULL &&
+			IsA(node->result, TupleTableSlot))
+			nTuples = 1.0;
 
-	result = (TupleTableSlot *) node->result;
+		InstrStopNode(node->instrument, nTuples);
+	}
+}
 
-	if (node->instrument)
-		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
-	return result;
+/* ----------------------------------------------------------------
+ *		ExecExecuteNode
+ *
+ *		Request the next tuple from the given node.  Note that
+ *		if the node supports asynchrony, result_ready may not be
+ *		set on return (use ExecProcNode if you need that, or call
+ *		ExecAsyncWaitForNode).
+ * ----------------------------------------------------------------
+ */
+void
+ExecExecuteNode(PlanState *node)
+{
+	node->result_ready = false;
+	ExecDispatchNode(node);
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecProcNode
+ *
+ *		Get the next tuple from the given node.  If the node is
+ *		asynchronous, wait for a tuple to be ready before
+ *		returning.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecProcNode(PlanState *node)
+{
+	CHECK_FOR_INTERRUPTS();
+
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	ExecDispatchNode(node);
+
+	if (!node->result_ready)
+		ExecAsyncWaitForNode(node);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	return (TupleTableSlot *) node->result;
 }
 
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..38b37a1
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.h
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execAsync.h
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncWaitForNode(PlanState *planstate);
+extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
+	bool reinit);
+extern void ExecAsyncDoesNotNeedWait(PlanState *planstate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 1eb09d8..7abc361 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -223,6 +223,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
 			 int eflags);
+extern void ExecDispatchNode(PlanState *node);
+extern void ExecExecuteNode(PlanState *node);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ff6c453..76e36a2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -382,6 +382,14 @@ typedef struct EState
 	ParamListInfo es_param_list_info;	/* values of external params */
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
+	/* Asynchronous execution support */
+	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
+	int			es_num_waiting_nodes;	/* # of waiters in array */
+	int			es_max_waiting_nodes;	/* # of allocated entries */
+	int			es_total_async_events;	/* total of per-node n_async_events */
+	int			es_max_async_events;	/* # supported by event set */
+	struct WaitEventSet *es_wait_event_set;
+
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
 
@@ -1034,6 +1042,8 @@ typedef struct PlanState
 	bool		result_ready;	/* true if result is ready */
 	Node	   *result;			/* result, most often TupleTableSlot */
 
+	int			n_async_events;	/* # of async events we want to register */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
1.8.3.1

0004-Fix-async-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From 471d6c97cce9aa1a903dedaf3b1f76b9588e9bba Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 17:23:16 +0900
Subject: [PATCH 4/7] Fix async execution framework.

This commit changes some behavior of the framework and fixes some
minor bugs.
---
 src/backend/executor/execAsync.c    | 128 +++++++++++++++++++++++-------------
 src/backend/executor/execProcnode.c |  59 +++++++++++++++--
 src/backend/executor/execScan.c     |  33 +++++++---
 src/backend/executor/nodeSeqscan.c  |   7 +-
 src/include/executor/execAsync.h    |   7 ++
 src/include/executor/executor.h     |  10 +++
 src/include/nodes/execnodes.h       |  21 ++++--
 7 files changed, 196 insertions(+), 69 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 20601fa..51902e6 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -29,7 +29,7 @@
 
 #define	EVENT_BUFFER_SIZE		16
 
-static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+static bool ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode mode);
 
 void
 ExecAsyncWaitForNode(PlanState *planstate)
@@ -37,13 +37,15 @@ ExecAsyncWaitForNode(PlanState *planstate)
 	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
 	PlanState  *callbacks[EVENT_BUFFER_SIZE];
 	int			ncallbacks = 0;
-	EState *estate = planstate->state;
+	EState     *estate = planstate->state;
+	AsyncContext *async_cxt = estate->es_async_cxt;
 
 	while (!planstate->result_ready)
 	{
-		bool	reinit = (estate->es_wait_event_set == NULL);
+		bool	reinit = (async_cxt->wait_event_set == NULL);
 		int		n;
 		int		noccurred;
+		bool	has_event = false;
 
 		if (reinit)
 		{
@@ -53,18 +55,39 @@ ExecAsyncWaitForNode(PlanState *planstate)
 			 * aggressive here, because plans that depend on massive numbers
 			 * of external FDs are likely to run afoul of kernel limits anyway.
 			 */
-			estate->es_max_async_events = estate->es_total_async_events + 16;
-			estate->es_wait_event_set =
-				CreateWaitEventSet(estate->es_query_cxt,
-								   estate->es_max_async_events);
+			async_cxt->max_events = async_cxt->total_events + 16;
+			async_cxt->wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt, async_cxt->max_events);
 		}
 
 		/* Give each waiting node a chance to add or modify events. */
-		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
-			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+		for (n = 0; n < async_cxt->num_waiting_nodes; ++n)
+			has_event |= 
+				ExecAsyncConfigureWait(async_cxt->waiting_nodes[n],
+						   reinit ? ASYNCCONF_TRY_ADD : ASYNCCONF_MODIFY);
 
-		/* Wait for at least one event to occur. */
-		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+		if (!has_event)
+		{
+			/*
+			 * No event to wait. This occurs when all the waiter shares the
+			 * object for sync with nodes in other sync-subtree. Anyway we
+			 * must have at least one event to wait.
+			 */
+
+			 for (n = 0; n < async_cxt->num_waiting_nodes; ++n)
+			 {
+				 if (ExecAsyncConfigureWait(async_cxt->waiting_nodes[n],
+											ASYNCCONF_FORCE_ADD))
+					 break;
+			 }
+
+			 /* Too bad. We don't have anyone to wait. */
+			 if (n == async_cxt->num_waiting_nodes)
+				 ereport(ERROR,
+						 (errmsg("inconsistency in asynchronous execution")));
+		}
+
+		noccurred = WaitEventSetWait(async_cxt->wait_event_set, -1,
 									 occurred_event, EVENT_BUFFER_SIZE);
 		Assert(noccurred > 0);
 
@@ -115,9 +138,10 @@ ExecAsyncWaitForNode(PlanState *planstate)
 
 				/*
 				 * If there's now a tuple ready, we must dispatch to the
-				 * parent node; otherwise, there's nothing more to do.
+				 * parent node up to the waiting root; otherwise, there's
+				 * nothing more to do.
 				 */
-				if (callbacks[i]->result_ready)
+				if (callbacks[i]->result_ready && callbacks[i] != planstate)
 					callbacks[i] = callbacks[i]->parent;
 				else
 					callbacks[i] = NULL;
@@ -143,54 +167,69 @@ ExecAsyncWaitForNode(PlanState *planstate)
 void
 ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
 {
-	EState *estate = planstate->state;
+	EState     *estate = planstate->state;
+	AsyncContext *async_cxt = estate->es_async_cxt;
 
 	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
 
 	/*
+	 * If no active async context found, make new one.
+	 */
+	if (async_cxt == NULL)
+	{
+		async_cxt =	MemoryContextAlloc(estate->es_query_cxt,
+									   sizeof(AsyncContext));
+		memset(async_cxt, 0, sizeof(AsyncContext));
+
+		planstate->state->es_async_cxt = async_cxt;
+	}
+
+	/*
 	 * If this node is not already present in the array of waiting nodes,
 	 * then add it.  If that array hasn't been allocated or is full, this may
 	 * require (re)allocating it.
 	 */
 	if (planstate->n_async_events == 0)
 	{
-		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		if (async_cxt->max_waiting_nodes <= async_cxt->num_waiting_nodes)
 		{
 			int		newmax;
 
-			if (estate->es_max_waiting_nodes == 0)
+			if (async_cxt->max_waiting_nodes == 0)
 			{
 				newmax = 16;
-				estate->es_waiting_nodes =
-					MemoryContextAlloc(estate->es_query_cxt, newmax);
+				async_cxt->waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt,
+									   newmax * sizeof(PlanState *));
 			}
 			else
 			{
-				newmax = estate->es_max_waiting_nodes * 2;
-				estate->es_waiting_nodes =
-					repalloc(estate->es_waiting_nodes,
+				newmax = async_cxt->max_waiting_nodes * 2;
+				async_cxt->waiting_nodes =
+					repalloc(async_cxt->waiting_nodes,
 							 newmax * sizeof(PlanState *));
 			}
-			estate->es_max_waiting_nodes = newmax;
+			async_cxt->max_waiting_nodes = newmax;
 		}
-		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+		async_cxt->waiting_nodes[async_cxt->num_waiting_nodes++] =
+			planstate;
 	}
 
-	/* Adjust per-node and per-estate totals. */
-	estate->es_total_async_events -= planstate->n_async_events;
+	/* Adjust per-node and per-asstate totals. */
+	async_cxt->total_events -= planstate->n_async_events;
 	planstate->n_async_events = nevents;
-	estate->es_total_async_events += planstate->n_async_events;
+	async_cxt->total_events += planstate->n_async_events;
 
 	/*
 	 * If a WaitEventSet has already been created, we need to discard it and
 	 * start again if the user passed reinit = true, or if the total number of
 	 * required events exceeds the supported number.
 	 */
-	if (estate->es_wait_event_set != NULL && (reinit ||
-		estate->es_total_async_events > estate->es_max_async_events))
+	if (async_cxt->wait_event_set != NULL && (reinit ||
+		async_cxt->total_events > async_cxt->max_events))
 	{
-		FreeWaitEventSet(estate->es_wait_event_set);
-		estate->es_wait_event_set = NULL;
+		FreeWaitEventSet(async_cxt->wait_event_set);
+		async_cxt->wait_event_set = NULL;
 	}
 }
 
@@ -202,7 +241,9 @@ void
 ExecAsyncDoesNotNeedWait(PlanState *planstate)
 {
 	int		n;
-	EState *estate = planstate->state;
+	AsyncContext *async_cxt = planstate->state->es_async_cxt;
+
+	Assert(async_cxt);
 
 	if (planstate->n_async_events <= 0)
 		return;
@@ -211,21 +252,20 @@ ExecAsyncDoesNotNeedWait(PlanState *planstate)
 	 * Remove the node from the list of waiting nodes.  (Is a linear search
 	 * going to be a problem here?  I think probably not.)
 	 */
-	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	for (n = 0; n < async_cxt->num_waiting_nodes; ++n)
 	{
-		if (estate->es_waiting_nodes[n] == planstate)
-		{
-			estate->es_waiting_nodes[n] =
-				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+		if (async_cxt->waiting_nodes[n] == planstate)
 			break;
-		}
 	}
 
 	/* We should always find ourselves in the array. */
-	Assert(n < estate->es_num_waiting_nodes);
+	Assert(n < async_cxt->num_waiting_nodes);
+
+	async_cxt->waiting_nodes[n] =
+		async_cxt->waiting_nodes[--async_cxt->num_waiting_nodes];
 
 	/* We no longer need any asynchronous events. */
-	estate->es_total_async_events -= planstate->n_async_events;
+	async_cxt->total_events -= planstate->n_async_events;
 	planstate->n_async_events = 0;
 
 	/*
@@ -234,18 +274,18 @@ ExecAsyncDoesNotNeedWait(PlanState *planstate)
 	 * assumes we actually did register some events at one point, because we
 	 * needed to wait at some point and we don't any more.
 	 */
-	if (estate->es_wait_event_set != NULL)
+	if (async_cxt->wait_event_set != NULL)
 	{
-		FreeWaitEventSet(estate->es_wait_event_set);
-		estate->es_wait_event_set = NULL;
+		FreeWaitEventSet(async_cxt->wait_event_set);
+		async_cxt->wait_event_set = NULL;
 	}
 }
 
 /*
  * Give per-nodetype function a chance to register wait events.
  */
-static void
-ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+static bool
+ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode config_mode)
 {
 	switch (nodeTag(planstate))
 	{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index b7ac08e..4f468c1 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -377,6 +377,9 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 void
 ExecDispatchNode(PlanState *node)
 {
+	if (node->result_ready)
+		return;
+
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -559,6 +562,8 @@ void
 ExecExecuteNode(PlanState *node)
 {
 	node->result_ready = false;
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
 	ExecDispatchNode(node);
 }
 
@@ -569,28 +574,44 @@ ExecExecuteNode(PlanState *node)
  *		Get the next tuple from the given node.  If the node is
  *		asynchronous, wait for a tuple to be ready before
  *		returning.
- * ----------------------------------------------------------------
+ *      The given node works as the termination node of an asynchronous
+ *      execution subtree and every subtree should have an individual context.
+ *      ----------------------------------------------------------------
  */
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
 	CHECK_FOR_INTERRUPTS();
 
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
+	/* Return unconsumed result if any */
+	if (node->result_ready)
+		return ExecConsumeResult(node);
 
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
+	/*
+	 * Stash an active async context to the parent context if any then
+	 * activate that of the given node.
+	 */
+	if (node->parent)
+		node->parent->save_async_cxt = node->state->es_async_cxt;
+	node->state->es_async_cxt = node->save_async_cxt;
+		
 	ExecDispatchNode(node);
 
 	if (!node->result_ready)
 		ExecAsyncWaitForNode(node);
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+	/*
+	 * Save the active async context to the given node and restore the
+	 * parent's. The stored context may have running nodes.
+	 */
+	node->save_async_cxt = node->state->es_async_cxt;
+	if (node->parent)
+		node->state->es_async_cxt = node->parent->save_async_cxt;
 
-	return (TupleTableSlot *) node->result;
+	return ExecConsumeResult(node);
 }
 
 
@@ -848,9 +869,22 @@ ExecEndNode(PlanState *node)
 bool
 ExecShutdownNode(PlanState *node)
 {
+	bool ret;
+
 	if (node == NULL)
 		return false;
 
+	/*
+	 * Maintain the active async context on executor state. Differnet from
+	 * ExecProcNode, this should be done only when saved context exists.
+	 */
+	if (node->save_async_cxt)
+	{
+		if (node->parent)
+			node->parent->save_async_cxt = node->state->es_async_cxt;
+		node->state->es_async_cxt = node->save_async_cxt;
+	}
+
 	switch (nodeTag(node))
 	{
 		case T_GatherState:
@@ -860,5 +894,16 @@ ExecShutdownNode(PlanState *node)
 			break;
 	}
 
-	return planstate_tree_walker(node, ExecShutdownNode, NULL);
+	ret = planstate_tree_walker(node, ExecShutdownNode, NULL);
+
+	/*
+	 * Restore the async context of the upper subtree only if exists.
+	 */
+	if (node->parent && node->parent->save_async_cxt)
+	{
+		node->save_async_cxt = node->state->es_async_cxt;
+		node->state->es_async_cxt = node->parent->save_async_cxt;
+	}
+
+	return ret;
 }
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 095d40b..69d616b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -128,6 +128,9 @@ ExecScan(ScanState *node,
 	ExprDoneCond isDone;
 	TupleTableSlot *resultSlot;
 
+	if (node->ps.result_ready)
+		return;
+
 	/*
 	 * Fetch data from node
 	 */
@@ -136,14 +139,25 @@ ExecScan(ScanState *node,
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * The underlying nodes don't use ExecReturnTuple. Set this flag here so
+	 * that the async-unaware/incapable children don't need to touch it
+	 * explicitly. Async-aware/capable nodes will unset it instead if needed.
+	 */
+	node->ps.result_ready = true;
+
+	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
 	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		ExecReturnTuple(&node->ps,
-						ExecScanFetch(node, accessMtd, recheckMtd));
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (node->ps.result_ready)
+			node->ps.result = (Node *) slot;
+
 		return;
 	}
 
@@ -158,7 +172,7 @@ ExecScan(ScanState *node,
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
 		{
-			ExecReturnTuple(&node->ps, resultSlot);
+			node->ps.result = (Node *) resultSlot;
 			return;
 		}
 		/* Done with that source tuple... */
@@ -184,6 +198,9 @@ ExecScan(ScanState *node,
 
 		slot = ExecScanFetch(node, accessMtd, recheckMtd);
 
+		if (!node->ps.result_ready)
+			return;
+
 		/*
 		 * if the slot returned by the accessMtd contains NULL, then it means
 		 * there is nothing more to scan so we just return an empty slot,
@@ -193,9 +210,9 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
-			else
-				ExecReturnTuple(&node->ps, slot);
+				slot = ExecClearTuple(projInfo->pi_slot);
+
+			node->ps.result = (Node *) slot;
 			return;
 		}
 
@@ -227,7 +244,7 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					ExecReturnTuple(&node->ps, resultSlot);
+					node->ps.result = (Node *) resultSlot;
 					return;
 				}
 			}
@@ -236,7 +253,7 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				ExecReturnTuple(&node->ps, slot);
+				node->ps.result = (Node *) slot;
 				return;
 			}
 		}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 0ca86d9..ef1ce9c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -124,9 +124,10 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 void
 ExecSeqScan(SeqScanState *node)
 {
-	return ExecScan((ScanState *) node,
-					(ExecScanAccessMtd) SeqNext,
-					(ExecScanRecheckMtd) SeqRecheck);
+	ExecScan((ScanState *) node,
+			 (ExecScanAccessMtd) SeqNext,
+			 (ExecScanRecheckMtd) SeqRecheck);
+
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index 38b37a1..f1c748b 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -15,6 +15,13 @@
 
 #include "nodes/execnodes.h"
 
+typedef enum AsyncConfigMode
+{
+	ASYNCCONF_MODIFY,
+	ASYNCCONF_TRY_ADD,
+	ASYNCCONF_FORCE_ADD
+} AsyncConfigMode;
+
 extern void ExecAsyncWaitForNode(PlanState *planstate);
 extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
 	bool reinit);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7abc361..ad19486 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -239,6 +239,16 @@ ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
 	node->result_ready = true;
 }
 
+/* Convenience function to retrieve a node's result. */
+static inline TupleTableSlot *
+ExecConsumeResult(PlanState *node)
+{
+	Assert(node->result_ready);
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+	node->result_ready = false;
+	return (TupleTableSlot *) node->result;
+}
+
 /*
  * prototypes from functions in execQual.c
  */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 76e36a2..9121537 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -345,6 +345,17 @@ typedef struct ResultRelInfo
 	List	   *ri_onConflictSetWhere;
 } ResultRelInfo;
 
+/* Asynchronous execution support */
+typedef struct AsyncContext
+{
+	struct PlanState **waiting_nodes;	/* array of waiting nodes */
+	int			num_waiting_nodes;		/* # of waiters in array */
+	int			max_waiting_nodes;		/* # of allocated entries */
+	int			total_events;			/* total of per-node n_async_events */
+	int			max_events;				/* # supported by event set */
+	struct WaitEventSet *wait_event_set;
+} AsyncContext;
+
 /* ----------------
  *	  EState information
  *
@@ -382,13 +393,7 @@ typedef struct EState
 	ParamListInfo es_param_list_info;	/* values of external params */
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
-	/* Asynchronous execution support */
-	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
-	int			es_num_waiting_nodes;	/* # of waiters in array */
-	int			es_max_waiting_nodes;	/* # of allocated entries */
-	int			es_total_async_events;	/* total of per-node n_async_events */
-	int			es_max_async_events;	/* # supported by event set */
-	struct WaitEventSet *es_wait_event_set;
+	AsyncContext  *es_async_cxt;		/* Async context currently active */
 
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
@@ -1060,6 +1065,8 @@ typedef struct PlanState
 								 * subselects) */
 	List	   *subPlan;		/* SubPlanState nodes in my expressions */
 
+	AsyncContext *save_async_cxt; /* Stash for async context */
+
 	/*
 	 * State for management of parameter-change-driven rescanning
 	 */
-- 
1.8.3.1

0005-Add-new-fdwroutine-AsyncConfigureWait-and-ShutdownFo.patchtext/x-patch; charset=us-asciiDownload
From dad07730989ae978739cb29a747a6642886887b0 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 17:25:30 +0900
Subject: [PATCH 5/7] Add new fdwroutine AsyncConfigureWait and
 ShutdownForeignScan.

Async-capable nodes should handle AsyncConfigureWait and
ExecShutdownNode callbacks. This patch adds entries for FDWs in the
two functions and adds corresponding FdwRoutine entries.
---
 src/backend/executor/execAsync.c    | 14 ++++++++++++--
 src/backend/executor/execProcnode.c |  9 +++++++++
 src/include/foreign/fdwapi.h        |  8 ++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 51902e6..00de11b 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -25,6 +25,7 @@
 
 #include "executor/execAsync.h"
 #include "executor/executor.h"
+#include "foreign/fdwapi.h"
 #include "storage/latch.h"
 
 #define	EVENT_BUFFER_SIZE		16
@@ -289,8 +290,17 @@ ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode config_mode)
 {
 	switch (nodeTag(planstate))
 	{
-		/* XXX: Add calls to per-nodetype handlers here. */
-		default:
+		/* Add calls to per-nodetype handlers here. */
+ 	case T_ForeignScanState:
+ 		{
+			ForeignScanState *node = (ForeignScanState *) planstate;
+			if (node->fdwroutine->AsyncConfigureWait)
+				return node->fdwroutine->AsyncConfigureWait(node, config_mode);
+		}
+		break;
+	default:
 			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
 	}
+
+	return false;
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 4f468c1..0e6ed39 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -890,6 +891,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+			{
+				ForeignScanState *fsstate = (ForeignScanState *)node;
+				FdwRoutine *fdwroutine = fsstate->fdwroutine;
+				if (fdwroutine->ShutdownForeignScan)
+					fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+			}
+			break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..8de44dd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "executor/execAsync.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
@@ -154,6 +155,9 @@ typedef void (*InitializeWorkerForeignScan_function) (ForeignScanState *node,
 typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
+typedef bool (*AsyncConfigureWait_function) (ForeignScanState *node,
+											 AsyncConfigMode config_mode);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -224,6 +228,10 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	AsyncConfigureWait_function AsyncConfigureWait;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
1.8.3.1

0006-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 6621421daece8718c0d6c31985ec77eea145f435 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 16:15:23 +0900
Subject: [PATCH 6/7] Make postgres_fdw async-capable

It sends the next FETCH just after the previous result is received and
returns !result_ready to the caller. This reduces the time to wait a
result for every fetch command. Multiple node on the same connection
are properly arbitrated.
---
 contrib/postgres_fdw/connection.c              |  81 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  34 +-
 contrib/postgres_fdw/postgres_fdw.c            | 513 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   4 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 5 files changed, 523 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8ca1c1c..0665d54 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -48,6 +48,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -63,6 +64,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -74,31 +76,17 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
-
+	
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
@@ -121,11 +109,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -138,8 +123,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+	
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -176,6 +192,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 107f0b7..eb71fd3 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5547,27 +5547,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 93ebd8c..b99bd1b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -50,6 +51,8 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -121,10 +124,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -135,7 +155,6 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -151,6 +170,12 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+									 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -164,11 +189,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -191,6 +216,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -201,7 +227,6 @@ typedef struct PgFdwDirectModifyState
 	bool		set_processed;	/* do we set the command es_processed? */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the update */
 	int			numParams;		/* number of parameters passed to query */
 	FmgrInfo   *param_flinfo;	/* output conversion functions for them */
 	List	   *param_exprs;	/* executable expressions for param values */
@@ -221,6 +246,7 @@ typedef struct PgFdwDirectModifyState
  */
 typedef struct PgFdwAnalyzeState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by query */
@@ -289,6 +315,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -345,6 +372,8 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
 							JoinPathExtraData *extra);
 static bool postgresRecheckForeignScan(ForeignScanState *node,
 						   TupleTableSlot *slot);
+static bool postgresAsyncConfigureWait(ForeignScanState *node,
+									   AsyncConfigMode mode);
 
 /*
  * Helper functions
@@ -365,7 +394,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -426,6 +458,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -457,6 +490,9 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for asynchronous execution */
+	routine->AsyncConfigureWait = postgresAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1313,12 +1349,20 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1372,27 +1416,122 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+	
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0);
+				if (!(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					node->ss.ps.result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}					
+
+			Assert(fsstate->async_waiting);
+
+			ExecAsyncDoesNotNeedWait((PlanState *) node);
+			fsstate->async_waiting = false;
+
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is owning this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			ExecAsyncNeedsWait((PlanState *) node, 1, false);
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			node->ss.ps.result_ready = false;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+			{
+				ExecAsyncNeedsWait((PlanState *) next_conn_owner, 1, false);
+				next_owner_state->async_waiting = true;
+			}
+		}
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			node->ss.ps.result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
@@ -1406,6 +1545,73 @@ postgresIterateForeignScan(ForeignScanState *node)
 	return slot;
 }
 
+
+static bool
+postgresAsyncConfigureWait(ForeignScanState *node, AsyncConfigMode mode)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	AsyncContext *async_cxt = node->ss.ps.state->es_async_cxt;
+
+	if ((mode == ASYNCCONF_TRY_ADD || mode == ASYNCCONF_FORCE_ADD) &&
+		fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(async_cxt->wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, node);
+		return true;
+	}
+
+	if (mode == ASYNCCONF_FORCE_ADD && fsstate->s.connspec->current_owner)
+	{
+		/*
+		 * We should somehow set a wait event. This occurs when the connection
+		 * owner does not resides in the current waiters' list. For the case,
+		 * forcibly make the connection owner finish the current request and
+		 * usurp the connection.
+		 */
+		ForeignScanState *owner = fsstate->s.connspec->current_owner;
+		PgFdwScanState *owner_state = GetPgFdwScanState(owner);
+		ForeignScanState *prev_waiter, *node_tmp;
+
+		fetch_received_data(owner);
+
+		/* find myself in the waiters' list */
+		prev_waiter = owner;
+
+		while (GetPgFdwScanState(prev_waiter)->waiter != node)
+			prev_waiter = GetPgFdwScanState(prev_waiter)->waiter;
+
+		/* Swap the previous owner and this node */
+		node_tmp = fsstate->waiter;
+
+		if (owner_state->waiter == node)
+			node_tmp = owner;
+		else
+		{
+			node_tmp = owner_state->waiter;
+			GetPgFdwScanState(prev_waiter)->waiter = owner;
+		}
+
+		owner_state->waiter = fsstate->waiter;
+		fsstate->waiter = node_tmp;
+
+		if (owner_state->last_waiter == node)
+			fsstate->last_waiter = prev_waiter;
+		else
+			fsstate->last_waiter = owner_state->last_waiter;
+		
+		request_more_data(node);
+		
+		/* now I am the connection owner */
+		AddWaitEventToSet(async_cxt->wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, node);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * postgresReScanForeignScan
  *		Restart the scan.
@@ -1413,7 +1619,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1421,6 +1627,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1449,9 +1658,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1469,7 +1678,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1477,16 +1686,41 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	if (fsstate->async_waiting)
+	{
+		Assert(node->ss.ps.state->es_async_cxt);
+
+		ExecAsyncDoesNotNeedWait((PlanState *) node);
+		fsstate->async_waiting = false;
+	}
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1688,7 +1922,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1769,6 +2005,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1779,14 +2017,15 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1794,10 +2033,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1835,6 +2074,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1855,14 +2096,15 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1870,10 +2112,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1911,6 +2153,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1931,14 +2175,15 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1946,10 +2191,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1996,16 +2241,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2285,7 +2530,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2340,7 +2587,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2387,8 +2637,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2506,6 +2756,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2548,6 +2799,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+		
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2825,11 +3086,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2895,47 +3156,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -2945,26 +3255,81 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
+/* 
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	Assert(!fsstate->async_waiting);
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+	}
+}
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3049,7 +3414,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3059,12 +3424,13 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3072,9 +3438,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3205,9 +3571,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn,
+						   false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3215,10 +3582,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4411,7 +4778,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 67126bc..b0c1266 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,7 +79,8 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
-
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
+	
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
 	 * relations but is set for all relations. For join relation, the name
@@ -100,6 +101,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 58c55a4..fd0ab25 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1252,8 +1252,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
1.8.3.1

0007-Make-Append-node-async-aware.patchtext/x-patch; charset=us-asciiDownload
From 2cb9c0f5f9c453d8b07c31625f99f25d8b73fe14 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 18:52:37 +0900
Subject: [PATCH 7/7] Make Append node async-aware.

Change append node to be capable to handle asynchronous children
properly. As soon as it receives !async_ready from a child, it moves
to the next child and if no child is ready, it sleeps until at least
one of them become ready.
---
 src/backend/executor/nodeAppend.c | 94 ++++++++++++++++++++++++++++++++-------
 src/include/nodes/execnodes.h     |  2 +
 2 files changed, 80 insertions(+), 16 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e0ce8c6..1c0d26e 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -121,6 +122,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 {
 	AppendState *appendstate = makeNode(AppendState);
 	PlanState **appendplanstates;
+	bool	   *finished;
 	int			nplans;
 	int			i;
 	ListCell   *lc;
@@ -134,14 +136,17 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	nplans = list_length(node->appendplans);
 
 	appendplanstates = (PlanState **) palloc0(nplans * sizeof(PlanState *));
-
+	finished = (bool *) palloc0(nplans * sizeof(bool));
+	
 	/*
 	 * create new AppendState for our append node
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
+	appendstate->as_finished = finished;
 	appendstate->as_nplans = nplans;
+	appendstate->as_async = ((eflags & EXEC_FLAG_BACKWARD) == 0);
 
 	/*
 	 * Miscellaneous initialization
@@ -194,49 +199,104 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 void
 ExecAppend(AppendState *node)
 {
-	for (;;)
+	int n_notready = 0;
+	PlanState  *subnode;
+	int i, n;
+
+	n = node->as_whichplan;
+
+	for (i = 0 ; i < node->as_nplans ; i++)
 	{
-		PlanState  *subnode;
 		TupleTableSlot *result;
 
+		if (node->as_async)
+		{
+			if (n >= node->as_nplans)
+				n = 0;
+
+			if (node->as_finished[n])
+			{
+				n++;
+				continue;
+			}
+		}
+
 		/*
 		 * figure out which subplan we are currently processing
 		 */
-		subnode = node->appendplans[node->as_whichplan];
+		subnode = node->appendplans[n];
 
 		/*
-		 * get a tuple from the subplan
+		 * execute the subplan to get a result if it is not ready yet
 		 */
-		result = ExecProcNode(subnode);
+		if (!subnode->result_ready)
+			ExecExecuteNode(subnode);
+
+		/*
+		 * This subnode is not ready yet when asynchrony is not allowed,
+		 * immediately wait for this subnode.
+		 */
+		if (!subnode->result_ready)
+		{
+			if (node->as_async)
+			{
+				n_notready++;
+				n++;
+				continue;
+			}
+			ExecAsyncWaitForNode(subnode);
+		}
+
+		Assert(subnode->result_ready);
+
+		result = ExecConsumeResult((PlanState *)subnode);
 
 		if (!TupIsNull(result))
 		{
 			/*
-			 * If the subplan gave us something then return it as-is. We do
-			 * NOT make use of the result slot that was set up in
-			 * ExecInitAppend; there's no need for it.
+			 * If the subplan gave us something then return it
+			 * as-is. We do NOT make use of the result slot that was
+			 * set up in ExecInitAppend; there's no need for it.
 			 */
+			node->as_whichplan = n;
 			ExecReturnTuple(&node->ps, result);
 			return;
 		}
 
+		/* Tuple has been exhausted on this subnode */
+		if (node->as_async)
+		{
+			node->as_finished[n++] = true;
+			continue;
+		}
+
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
 		 * ExecInitAppend.
 		 */
 		if (ScanDirectionIsForward(node->ps.state->es_direction))
-			node->as_whichplan++;
+		{
+			if (n >= node->as_nplans - 1)
+				break;
+			n++;
+		}
 		else
-			node->as_whichplan--;
-		if (!exec_append_initialize_next(node))
 		{
-			ExecReturnTuple(&node->ps,
-							ExecClearTuple(node->ps.ps_ResultTupleSlot));
-			return;
+			if (n == 0)
+				break;
+			n--;
 		}
+	}
 
-		/* Else loop back and try to get a tuple from the new subplan */
+	/*
+	 * We are finished if reached here and no subnodes are not-ready
+	 */
+	if (n_notready == 0)
+	{
+		node->as_whichplan = n;
+		ExecReturnTuple(&node->ps,
+						ExecClearTuple(node->ps.ps_ResultTupleSlot));
 	}
 }
 
@@ -277,6 +337,8 @@ ExecReScanAppend(AppendState *node)
 	{
 		PlanState  *subnode = node->appendplans[i];
 
+		node->as_finished[i] = false;
+
 		/*
 		 * ExecReScan doesn't know about my subplans, so I have to do
 		 * changed-parameter signaling myself.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9121537..e4f2bb6 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1170,6 +1170,8 @@ typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
+	bool		as_async;		/* true to allow async execution */
+	bool	   *as_finished;	/* array of the running state of subplans */
 	int			as_nplans;
 	int			as_whichplan;
 } AppendState;
-- 
1.8.3.1

gentbl.sqltext/plain; charset=us-asciiDownload
testrun.shtext/plain; charset=us-asciiDownload
#44Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro HORIGUCHI (#43)
Re: asynchronous and vectorized execution

On Wed, Jul 6, 2016 at 3:29 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

This seems to be a good opportunity to show this patch. The
attched patch set does async execution of foreignscan
(postgres_fdw) on the Robert's first infrastructure, with some
modification.

Cool.

ExecAsyncWaitForNode can get into an inifite-waiting by recursive
calls of ExecAsyncWaitForNode caused by ExecProcNode called from
async-unaware nodes. Such recursive calls cause a wait on
already-ready nodes.

Hmm, that's annoying.

I solved that in the patch set by allocating a separate
async-execution context for every async-execution subtrees, which
is made by ExecProcNode, instead of one async-exec context for
the whole execution tree. This works fine but the way switching
contexts seems ugly. This may also be solved by make
ExecAsyncWaitForNode return when no node to wait even if the
waiting node is not ready. This might keep the async-exec context
(state) simpler so I'll try this.

I think you should instead try to make ExecAsyncWaitForNode properly reentrant.

Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
or something like instead of latches?

Why would it do that?

BTW, we also need to benchmark those changes to add the parent
pointers and change the return conventions and see if they have any
measurable impact on performance.

I have to bring you a bad news.

With the attached patch, an append on four foreign scans on one
server (at local) performs faster by about 10% and by twice for
three or four foreign scns on separate foreign servers
(connections) respectively, but only when compiled with -O0. I
found that it can take hopelessly small amount of advantage from
compiler optimization, while unpatched version gets faster.

Two things:

1. That's not the scenario I'm talking about. I'm concerned about
making sure that query plans that don't use asynchronous execution
don't get slower.

2. I have to believe that's a defect in your implementation rather
than something intrinsic, or maybe your test scenario is bad. It's
very hard - really impossible - to believe that all queries involving
FDW pushdown are locally CPU-bound.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Robert Haas (#44)
Re: asynchronous and vectorized execution

Hello,

At Thu, 7 Jul 2016 13:59:54 -0400, Robert Haas <robertmhaas@gmail.com> wrote in <CA+TgmobD9uM9=zVz+jvTyEM_o9rwDP3RBJkJPzb0HCpR9-085A@mail.gmail.com>

On Wed, Jul 6, 2016 at 3:29 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

This seems to be a good opportunity to show this patch. The
attched patch set does async execution of foreignscan
(postgres_fdw) on the Robert's first infrastructure, with some
modification.

Cool.

Thank you.

ExecAsyncWaitForNode can get into an inifite-waiting by recursive
calls of ExecAsyncWaitForNode caused by ExecProcNode called from
async-unaware nodes. Such recursive calls cause a wait on
already-ready nodes.

Hmm, that's annoying.

I solved that in the patch set by allocating a separate
async-execution context for every async-execution subtrees, which
is made by ExecProcNode, instead of one async-exec context for
the whole execution tree. This works fine but the way switching
contexts seems ugly. This may also be solved by make
ExecAsyncWaitForNode return when no node to wait even if the
waiting node is not ready. This might keep the async-exec context
(state) simpler so I'll try this.

I think you should instead try to make ExecAsyncWaitForNode properly reentrant.

I feel the same way. Will try to do that.

Does the ParallelWorkerSetLatchesForGroup use mutex or semaphore
or something like instead of latches?

Why would it do that?

I might misunderstand the original sentence but the reason of my
question there is that I didn't see the connection between "When
an executor node does something that might unblock other workers,
it calls ParallelWorkerSetLatchesForGroup()" and "and the async
stuff then tries calling all of the nodes in this array". I
supposed that the former takes place on each worker and the
latter should do the latter on the leader. So I asked the means
to signal the leader to do the latter thing. I should be wrong,
because I feel uneasy (or confused) with this statement..

BTW, we also need to benchmark those changes to add the parent
pointers and change the return conventions and see if they have any
measurable impact on performance.

I have to bring you a bad news.

With the attached patch, an append on four foreign scans on one
server (at local) performs faster by about 10% and by twice for
three or four foreign scns on separate foreign servers
(connections) respectively, but only when compiled with -O0. I
found that it can take hopelessly small amount of advantage from
compiler optimization, while unpatched version gets faster.

Two things:

1. That's not the scenario I'm talking about. I'm concerned about
making sure that query plans that don't use asynchronous execution
don't get slower.

The first one donen't (select for t0) is that. It have any
relation with asynchronous staff except the result_ready flag, a
branch caused by it and calling ExecDispatchNode. The difference
from the original is ExecProcNode uses ExecDispatchNode. Even
ExecAsyncWaitForNode is not called.

2. I have to believe that's a defect in your implementation rather
than something intrinsic, or maybe your test scenario is bad. It's
very hard - really impossible - to believe that all queries involving
FDW pushdown are locally CPU-bound.

Sorry for hard-to-read result but the problem is not in a query
involving FDW, but a query on a local table (but runs parallel
seqscan). The reason of the difference for the tests involving
FDW should be local scans on the remote side.

Just reverting ExecProcNode and other related part doesn't change
the situation. I proceed the confirmation reverting part by
part.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#45)
Re: asynchronous and vectorized execution

At Mon, 11 Jul 2016 17:10:11 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160711.171011.133133724.horiguchi.kyotaro@lab.ntt.co.jp>

Two things:

1. That's not the scenario I'm talking about. I'm concerned about
making sure that query plans that don't use asynchronous execution
don't get slower.

The first one donen't (select for t0) is that. It have any
relation with asynchronous staff except the result_ready flag, a
branch caused by it and calling ExecDispatchNode. The difference
from the original is ExecProcNode uses ExecDispatchNode. Even
ExecAsyncWaitForNode is not called.

2. I have to believe that's a defect in your implementation rather
than something intrinsic, or maybe your test scenario is bad. It's
very hard - really impossible - to believe that all queries involving
FDW pushdown are locally CPU-bound.

Sorry for hard-to-read result but the problem is not in a query
involving FDW, but a query on a local table (but runs parallel
seqscan). The reason of the difference for the tests involving
FDW should be local scans on the remote side.

Just reverting ExecProcNode and other related part doesn't change
the situation. I proceed the confirmation reverting part by
part.

Uggg. I had no difference even after finally bumped into master.
What is more strange, a binary built from what should be the same
"master" but extended by "git archive | tar" finishes the query
(select sum(a) from t0) in a half time to the master in my git
reposiotrty with -O2. In short, the patch doesn't seem to be the
cause of the difference.

I should investigate the difference between them, or begin again
with a clean environment..

Anyway I need some time to cool down..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#47Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#46)
Re: asynchronous and vectorized execution

Cooled down then measured performance again.

I show you the true result briefly for now.

At Mon, 11 Jul 2016 19:07:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160711.190722.145849861.horiguchi.kyotaro@lab.ntt.co.jp>

Anyway I need some time to cool down..

I recalled that I put Makefile.custom that contains
CFLAGS="-O0". Removing that gave me a sainer result.

patched- -O2

table 10-average(ms) stddev runtime-diff from unpatched(%)
t0 441.78 0.32 3.4
pl 201.77 0.32 13.6
pf0 6619.22 18.99 -19.7
pf1 1800.72 32.72 -78.0
---
unpatched- -O2

t0 427.21 0.42
pl 177.54 0.25
pf0 8250.42 23.29
pf1 8206.02 12.91

==========

3% slower for local 1*seqscan (2-parallel)
14% slower for append-4*seqscan (no-prallel)
19% faster for append-4*foreignscan (all scans on one connection)
78% faster for append-4*foreignscan (scans have dedicate connection)

ExecProcNode might be able to be optimized a bit.
ExecAppend seems to need some fix.

Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
or something.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#47)
Re: asynchronous and vectorized execution

I forgot to mention.

At Tue, 12 Jul 2016 11:04:17 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160712.110417.145469826.horiguchi.kyotaro@lab.ntt.co.jp>

Cooled down then measured performance again.

I show you the true result briefly for now.

At Mon, 11 Jul 2016 19:07:22 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160711.190722.145849861.horiguchi.kyotaro@lab.ntt.co.jp>

Anyway I need some time to cool down..

I recalled that I put Makefile.custom that contains
CFLAGS="-O0". Removing that gave me a sainer result.

Different from the previous measurements, the remote side in
these measurements is unpatched-O2 postgres, so the differences
are made only by the local-side changes.

patched- -O2

table 10-average(ms) stddev runtime-diff from unpatched(%)
t0 441.78 0.32 3.4
pl 201.77 0.32 13.6
pf0 6619.22 18.99 -19.7
pf1 1800.72 32.72 -78.0
---
unpatched- -O2

t0 427.21 0.42
pl 177.54 0.25
pf0 8250.42 23.29
pf1 8206.02 12.91

==========

3% slower for local 1*seqscan (2-parallel)
14% slower for append-4*seqscan (no-prallel)
19% faster for append-4*foreignscan (all scans on one connection)
78% faster for append-4*foreignscan (scans have dedicate connection)

ExecProcNode might be able to be optimized a bit.
ExecAppend seems to need some fix.

Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
or something.

regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#48)
7 attachment(s)
Re: asynchronous and vectorized execution

Hello,

At Tue, 12 Jul 2016 11:42:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160712.114255.156540680.horiguchi.kyotaro@lab.ntt.co.jp>

3% slower for local 1*seqscan (2-parallel)
14% slower for append-4*seqscan (no-prallel)
19% faster for append-4*foreignscan (all scans on one connection)
78% faster for append-4*foreignscan (scans have dedicate connection)

ExecProcNode might be able to be optimized a bit.
ExecAppend seems to need some fix.

After some refactoring, degradation for a simple seqscan is
reduced to 1.4% and that of "Append(SeqScan())" is reduced to
1.7%. The gains are the same to the previous measurement. Scale
has been changed from previous measurement in some test cases.

t0- (SeqScan()) (2 parallel)
pl- (Append(4 * SeqScan()))
pf0 (Append(4 * ForeignScan())) all ForeignScans are on the same connection.
pf1 (Append(4 * ForeignScan())) all ForeignScans have their own connections.

patched-O2 time(ms) stddev(ms) gain from unpatched (%)
t0 4121.27 1.1 -1.44
pl 1757.41 0.94 -1.73
pf0 6458.99 192.4 20.26
pf1 1747.4 24.81 78.39

unpatched-O2
t0 4062.6 1.95
pl 1727.45 9.41
pf0 8100.47 24.51
pf1 8086.52 33.53

Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
or something.

After some consideration, I found that ExecAsyncWaitForNode
cannot be reentrant because it means that the control goes into
async-unaware nodes while having not-ready nodes, that is
inconsistent state. To inhibit such reentering, I allocated node
identifiers in depth-first order so that ascendant-descendant
relationship can be checked (nested-set model) in simple way and
call ExecAsyncConfigureWait only for the descendant nodes of the
parameter planstate.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patchtext/x-patch; charset=us-asciiDownload
From a8eb587236315ed5481c6b7e2d771197e3f4bf35 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 4 May 2016 12:19:03 -0400
Subject: [PATCH 1/7] Modify PlanState to include a pointer to the parent
 PlanState.

---
 src/backend/executor/execMain.c           | 22 ++++++++++++++--------
 src/backend/executor/execProcnode.c       |  5 ++++-
 src/backend/executor/nodeAgg.c            |  3 ++-
 src/backend/executor/nodeAppend.c         |  3 ++-
 src/backend/executor/nodeBitmapAnd.c      |  3 ++-
 src/backend/executor/nodeBitmapHeapscan.c |  3 ++-
 src/backend/executor/nodeBitmapOr.c       |  3 ++-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeGather.c         |  3 ++-
 src/backend/executor/nodeGroup.c          |  3 ++-
 src/backend/executor/nodeHash.c           |  3 ++-
 src/backend/executor/nodeHashjoin.c       |  6 ++++--
 src/backend/executor/nodeLimit.c          |  3 ++-
 src/backend/executor/nodeLockRows.c       |  3 ++-
 src/backend/executor/nodeMaterial.c       |  3 ++-
 src/backend/executor/nodeMergeAppend.c    |  3 ++-
 src/backend/executor/nodeMergejoin.c      |  4 +++-
 src/backend/executor/nodeModifyTable.c    |  3 ++-
 src/backend/executor/nodeNestloop.c       |  6 ++++--
 src/backend/executor/nodeRecursiveunion.c |  6 ++++--
 src/backend/executor/nodeResult.c         |  3 ++-
 src/backend/executor/nodeSetOp.c          |  3 ++-
 src/backend/executor/nodeSort.c           |  3 ++-
 src/backend/executor/nodeSubplan.c        |  1 +
 src/backend/executor/nodeSubqueryscan.c   |  3 ++-
 src/backend/executor/nodeUnique.c         |  3 ++-
 src/backend/executor/nodeWindowAgg.c      |  3 ++-
 src/include/executor/executor.h           |  3 ++-
 src/include/nodes/execnodes.h             |  2 ++
 29 files changed, 77 insertions(+), 37 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..ac6d62c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -923,7 +923,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
-	 * ExecInitSubPlan expects to be able to find these entries.
+	 * ExecInitSubPlan expects to be able to find these entries. Since the
+	 * main plan tree hasn't been initialized yet, we have to pass NULL as the
+	 * parent node to ExecInitNode; ExecInitSubPlan also takes responsibility
+	 * for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	i = 1;						/* subplan indices count from 1 */
@@ -943,7 +946,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
-		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
+		subplanstate = ExecInitNode(subplan, estate, NULL, sp_eflags);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
@@ -954,9 +957,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize the private state information for all the nodes in the query
 	 * tree.  This opens files, allocates storage and leaves us ready to start
-	 * processing tuples.
+	 * processing tuples.  This is the root planstate node; it has no parent.
 	 */
-	planstate = ExecInitNode(plan, estate, eflags);
+	planstate = ExecInitNode(plan, estate, NULL, eflags);
 
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
@@ -2849,7 +2852,9 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * ExecInitSubPlan expects to be able to find these entries. Some of the
 	 * SubPlans might not be used in the part of the plan tree we intend to
 	 * run, but since it's not easy to tell which, we just initialize them
-	 * all.
+	 * all.  Since the main plan tree hasn't been initialized yet, we have to
+	 * pass NULL as the parent node to ExecInitNode; ExecInitSubPlan also
+	 * takes responsibility for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	foreach(l, parentestate->es_plannedstmt->subplans)
@@ -2857,7 +2862,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;
 
-		subplanstate = ExecInitNode(subplan, estate, 0);
+		subplanstate = ExecInitNode(subplan, estate, NULL, 0);
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
 	}
@@ -2865,9 +2870,10 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	/*
 	 * Initialize the private state information for all the nodes in the part
 	 * of the plan tree we need to run.  This opens files, allocates storage
-	 * and leaves us ready to start processing tuples.
+	 * and leaves us ready to start processing tuples.  This is the root plan
+	 * node; it has no parent.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->planstate = ExecInitNode(planTree, estate, NULL, 0);
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..680ca4b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -133,7 +133,7 @@
  * ------------------------------------------------------------------------
  */
 PlanState *
-ExecInitNode(Plan *node, EState *estate, int eflags)
+ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 {
 	PlanState  *result;
 	List	   *subps;
@@ -340,6 +340,9 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 			break;
 	}
 
+	/* Set parent pointer. */
+	result->parent = parent;
+
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
 	 * a separate list for us.
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b3187e6..2c11acb 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2427,7 +2427,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(aggstate) =
+		ExecInitNode(outerPlan, estate, &aggstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type.
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..beb4ab8 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -165,7 +165,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, &appendstate->ps,
+										   eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index c39d790..6405fa4 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -81,7 +81,8 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmapandstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..2ba5cd0 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -646,7 +646,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * relation's indexes, and we want to be sure we have acquired a lock on
 	 * the relation first.
 	 */
-	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate,
+											 &scanstate->ss.ps, eflags);
 
 	/*
 	 * all done.
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 7e928eb..faa3a37 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -82,7 +82,8 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmaporstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..7d9160d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -224,7 +224,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	/* Initialize any outer plan. */
 	if (outerPlan(node))
 		outerPlanState(scanstate) =
-			ExecInitNode(outerPlan(node), estate, eflags);
+			ExecInitNode(outerPlan(node), estate, &scanstate->ss.ps, eflags);
 
 	/*
 	 * Tell the FDW to initialize the scan.
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 313b234..6da52b3 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -97,7 +97,8 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gatherstate) =
+		ExecInitNode(outerNode, estate, &gatherstate->ps, eflags);
 
 	gatherstate->ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index dcf5175..3c066fc 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -233,7 +233,8 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(grpstate) =
+		ExecInitNode(outerPlan(node), estate, &grpstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 9ed09a7..5e78de0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -200,7 +200,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(hashstate) =
+		ExecInitNode(outerPlan(node), estate, &hashstate->ps, eflags);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 369e666..a7a908a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -486,8 +486,10 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	outerNode = outerPlan(node);
 	hashNode = (Hash *) innerPlan(node);
 
-	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags);
-	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
+	outerPlanState(hjstate) =
+		ExecInitNode(outerNode, estate, &hjstate->js.ps, eflags);
+	innerPlanState(hjstate) =
+		ExecInitNode((Plan *) hashNode, estate, &hjstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index faf32e1..97267c5 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -412,7 +412,8 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(limitstate) =
+		ExecInitNode(outerPlan, estate, &limitstate->ps, eflags);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 4ebcaff..c4b5333 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -376,7 +376,8 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(lrstate) =
+		ExecInitNode(outerPlan, estate, &lrstate->ps, eflags);
 
 	/*
 	 * LockRows nodes do no projections, so initialize projection info for
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9ab03f3..82e31c1 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -219,7 +219,8 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
 	outerPlan = outerPlan(node);
-	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(matstate) =
+		ExecInitNode(outerPlan, estate, &matstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e271927..ae0e8dc 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -112,7 +112,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		mergeplanstates[i] =
+			ExecInitNode(initNode, estate, &mergestate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 6db09b8..cd8d6c6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1522,8 +1522,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 *
 	 * inner child must support MARK/RESTORE.
 	 */
-	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(mergestate) =
+		ExecInitNode(outerPlan(node), estate, &mergestate->js.ps, eflags);
 	innerPlanState(mergestate) = ExecInitNode(innerPlan(node), estate,
+											  &mergestate->js.ps,
 											  eflags | EXEC_FLAG_MARK);
 
 	/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..95cc2c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1618,7 +1618,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
-		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
+		mtstate->mt_plans[i] =
+			ExecInitNode(subplan, estate, &mtstate->ps, eflags);
 
 		/* Also let FDWs init themselves for foreign-table result rels */
 		if (!resultRelInfo->ri_usesFdwDirectModify &&
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 555fa09..1895b60 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -340,12 +340,14 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
-	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(nlstate) =
+		ExecInitNode(outerPlan(node), estate, &nlstate->js.ps, eflags);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
-	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
+	innerPlanState(nlstate) =
+		ExecInitNode(innerPlan(node), estate, &nlstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index e76405a..2328ef3 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -245,8 +245,10 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags);
-	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags);
+	outerPlanState(rustate) =
+		ExecInitNode(outerPlan(node), estate, &rustate->ps, eflags);
+	innerPlanState(rustate) =
+		ExecInitNode(innerPlan(node), estate, &rustate->ps, eflags);
 
 	/*
 	 * If hashing, precompute fmgr lookup data for inner loop, and create the
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 4007b76..0d2de14 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -250,7 +250,8 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(resstate) =
+		ExecInitNode(outerPlan(node), estate, &resstate->ps, eflags);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 2d81d46..7a3b67c 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -537,7 +537,8 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
-	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(setopstate) =
+		ExecInitNode(outerPlan(node), estate, &setopstate->ps, eflags);
 
 	/*
 	 * setop nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a34dcc5..0286a7f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -199,7 +199,8 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(sortstate) =
+		ExecInitNode(outerPlan(node), estate, &sortstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index e503494..458e254 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -707,6 +707,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 
 	/* ... and to its parent's state */
 	sstate->parent = parent;
+	sstate->planstate->parent = parent;
 
 	/* Initialize subexpressions */
 	sstate->testexpr = ExecInitExpr((Expr *) subplan->testexpr, parent);
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 9bafc62..cb007a5 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -136,7 +136,8 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	/*
 	 * initialize subquery
 	 */
-	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags);
+	subquerystate->subplan =
+		ExecInitNode(node->subplan, estate, &subquerystate->ss.ps, eflags);
 
 	subquerystate->ss.ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 4caae34..5d13a89 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -145,7 +145,8 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(uniquestate) =
+		ExecInitNode(outerPlan(node), estate, &uniquestate->ps, eflags);
 
 	/*
 	 * unique nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index d4c88a1..bae713b 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1844,7 +1844,8 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(winstate) =
+		ExecInitNode(outerPlan, estate, &winstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type (which is also the tuple type that we'll
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39521ed..28c0c2e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -221,7 +221,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 /*
  * prototypes from functions in execProcnode.c
  */
-extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
+extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
+			 int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7fd7bd..4b18436 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1030,6 +1030,8 @@ typedef struct PlanState
 								 * nodes point to one EState for the whole
 								 * top-level plan */
 
+	struct PlanState *parent;	/* node which will receive tuples from us */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
1.8.3.1

0002-Modify-PlanState-to-have-result-result_ready-fields..patchtext/x-patch; charset=us-asciiDownload
From 1bdda63123dcab5cb026a43c674effb711167476 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 6 May 2016 13:01:48 -0400
Subject: [PATCH 2/7] Modify PlanState to have result/result_ready fields.
 Modify executor to use them instead of returning tuples directly.

---
 src/backend/executor/execProcnode.c       | 75 ++++++++++++++++++-------------
 src/backend/executor/execScan.c           | 26 +++++++----
 src/backend/executor/nodeAgg.c            | 13 +++---
 src/backend/executor/nodeAppend.c         | 11 +++--
 src/backend/executor/nodeBitmapHeapscan.c |  2 +-
 src/backend/executor/nodeCtescan.c        |  2 +-
 src/backend/executor/nodeCustom.c         |  4 +-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeFunctionscan.c   |  2 +-
 src/backend/executor/nodeGather.c         | 17 ++++---
 src/backend/executor/nodeGroup.c          | 24 +++++++---
 src/backend/executor/nodeHash.c           |  3 +-
 src/backend/executor/nodeHashjoin.c       | 29 ++++++++----
 src/backend/executor/nodeIndexonlyscan.c  |  2 +-
 src/backend/executor/nodeIndexscan.c      |  2 +-
 src/backend/executor/nodeLimit.c          | 42 ++++++++++++-----
 src/backend/executor/nodeLockRows.c       |  9 ++--
 src/backend/executor/nodeMaterial.c       | 21 ++++++---
 src/backend/executor/nodeMergeAppend.c    |  4 +-
 src/backend/executor/nodeMergejoin.c      | 74 ++++++++++++++++++++++--------
 src/backend/executor/nodeModifyTable.c    | 15 ++++---
 src/backend/executor/nodeNestloop.c       | 16 ++++---
 src/backend/executor/nodeRecursiveunion.c | 10 +++--
 src/backend/executor/nodeResult.c         | 20 ++++++---
 src/backend/executor/nodeSamplescan.c     |  2 +-
 src/backend/executor/nodeSeqscan.c        |  2 +-
 src/backend/executor/nodeSetOp.c          | 14 +++---
 src/backend/executor/nodeSort.c           |  4 +-
 src/backend/executor/nodeSubqueryscan.c   |  2 +-
 src/backend/executor/nodeTidscan.c        |  2 +-
 src/backend/executor/nodeUnique.c         |  8 ++--
 src/backend/executor/nodeValuesscan.c     |  2 +-
 src/backend/executor/nodeWindowAgg.c      | 17 ++++---
 src/backend/executor/nodeWorktablescan.c  |  2 +-
 src/include/executor/executor.h           | 11 ++++-
 src/include/executor/nodeAgg.h            |  2 +-
 src/include/executor/nodeAppend.h         |  2 +-
 src/include/executor/nodeBitmapHeapscan.h |  2 +-
 src/include/executor/nodeCtescan.h        |  2 +-
 src/include/executor/nodeCustom.h         |  2 +-
 src/include/executor/nodeForeignscan.h    |  2 +-
 src/include/executor/nodeFunctionscan.h   |  2 +-
 src/include/executor/nodeGather.h         |  2 +-
 src/include/executor/nodeGroup.h          |  2 +-
 src/include/executor/nodeHash.h           |  2 +-
 src/include/executor/nodeHashjoin.h       |  2 +-
 src/include/executor/nodeIndexonlyscan.h  |  2 +-
 src/include/executor/nodeIndexscan.h      |  2 +-
 src/include/executor/nodeLimit.h          |  2 +-
 src/include/executor/nodeLockRows.h       |  2 +-
 src/include/executor/nodeMaterial.h       |  2 +-
 src/include/executor/nodeMergeAppend.h    |  2 +-
 src/include/executor/nodeMergejoin.h      |  2 +-
 src/include/executor/nodeModifyTable.h    |  2 +-
 src/include/executor/nodeNestloop.h       |  2 +-
 src/include/executor/nodeRecursiveunion.h |  2 +-
 src/include/executor/nodeResult.h         |  2 +-
 src/include/executor/nodeSamplescan.h     |  2 +-
 src/include/executor/nodeSeqscan.h        |  2 +-
 src/include/executor/nodeSetOp.h          |  2 +-
 src/include/executor/nodeSort.h           |  2 +-
 src/include/executor/nodeSubqueryscan.h   |  2 +-
 src/include/executor/nodeTidscan.h        |  2 +-
 src/include/executor/nodeUnique.h         |  2 +-
 src/include/executor/nodeValuesscan.h     |  2 +-
 src/include/executor/nodeWindowAgg.h      |  2 +-
 src/include/executor/nodeWorktablescan.h  |  2 +-
 src/include/nodes/execnodes.h             |  2 +
 68 files changed, 360 insertions(+), 197 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 680ca4b..3f2ebff 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -380,6 +380,9 @@ ExecProcNode(PlanState *node)
 
 	CHECK_FOR_INTERRUPTS();
 
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
@@ -392,23 +395,23 @@ ExecProcNode(PlanState *node)
 			 * control nodes
 			 */
 		case T_ResultState:
-			result = ExecResult((ResultState *) node);
+			ExecResult((ResultState *) node);
 			break;
 
 		case T_ModifyTableState:
-			result = ExecModifyTable((ModifyTableState *) node);
+			ExecModifyTable((ModifyTableState *) node);
 			break;
 
 		case T_AppendState:
-			result = ExecAppend((AppendState *) node);
+			ExecAppend((AppendState *) node);
 			break;
 
 		case T_MergeAppendState:
-			result = ExecMergeAppend((MergeAppendState *) node);
+			ExecMergeAppend((MergeAppendState *) node);
 			break;
 
 		case T_RecursiveUnionState:
-			result = ExecRecursiveUnion((RecursiveUnionState *) node);
+			ExecRecursiveUnion((RecursiveUnionState *) node);
 			break;
 
 			/* BitmapAndState does not yield tuples */
@@ -419,119 +422,119 @@ ExecProcNode(PlanState *node)
 			 * scan nodes
 			 */
 		case T_SeqScanState:
-			result = ExecSeqScan((SeqScanState *) node);
+			ExecSeqScan((SeqScanState *) node);
 			break;
 
 		case T_SampleScanState:
-			result = ExecSampleScan((SampleScanState *) node);
+			ExecSampleScan((SampleScanState *) node);
 			break;
 
 		case T_IndexScanState:
-			result = ExecIndexScan((IndexScanState *) node);
+			ExecIndexScan((IndexScanState *) node);
 			break;
 
 		case T_IndexOnlyScanState:
-			result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
+			ExecIndexOnlyScan((IndexOnlyScanState *) node);
 			break;
 
 			/* BitmapIndexScanState does not yield tuples */
 
 		case T_BitmapHeapScanState:
-			result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
+			ExecBitmapHeapScan((BitmapHeapScanState *) node);
 			break;
 
 		case T_TidScanState:
-			result = ExecTidScan((TidScanState *) node);
+			ExecTidScan((TidScanState *) node);
 			break;
 
 		case T_SubqueryScanState:
-			result = ExecSubqueryScan((SubqueryScanState *) node);
+			ExecSubqueryScan((SubqueryScanState *) node);
 			break;
 
 		case T_FunctionScanState:
-			result = ExecFunctionScan((FunctionScanState *) node);
+			ExecFunctionScan((FunctionScanState *) node);
 			break;
 
 		case T_ValuesScanState:
-			result = ExecValuesScan((ValuesScanState *) node);
+			ExecValuesScan((ValuesScanState *) node);
 			break;
 
 		case T_CteScanState:
-			result = ExecCteScan((CteScanState *) node);
+			ExecCteScan((CteScanState *) node);
 			break;
 
 		case T_WorkTableScanState:
-			result = ExecWorkTableScan((WorkTableScanState *) node);
+			ExecWorkTableScan((WorkTableScanState *) node);
 			break;
 
 		case T_ForeignScanState:
-			result = ExecForeignScan((ForeignScanState *) node);
+			ExecForeignScan((ForeignScanState *) node);
 			break;
 
 		case T_CustomScanState:
-			result = ExecCustomScan((CustomScanState *) node);
+			ExecCustomScan((CustomScanState *) node);
 			break;
 
 			/*
 			 * join nodes
 			 */
 		case T_NestLoopState:
-			result = ExecNestLoop((NestLoopState *) node);
+			ExecNestLoop((NestLoopState *) node);
 			break;
 
 		case T_MergeJoinState:
-			result = ExecMergeJoin((MergeJoinState *) node);
+			ExecMergeJoin((MergeJoinState *) node);
 			break;
 
 		case T_HashJoinState:
-			result = ExecHashJoin((HashJoinState *) node);
+			ExecHashJoin((HashJoinState *) node);
 			break;
 
 			/*
 			 * materialization nodes
 			 */
 		case T_MaterialState:
-			result = ExecMaterial((MaterialState *) node);
+			ExecMaterial((MaterialState *) node);
 			break;
 
 		case T_SortState:
-			result = ExecSort((SortState *) node);
+			ExecSort((SortState *) node);
 			break;
 
 		case T_GroupState:
-			result = ExecGroup((GroupState *) node);
+			ExecGroup((GroupState *) node);
 			break;
 
 		case T_AggState:
-			result = ExecAgg((AggState *) node);
+			ExecAgg((AggState *) node);
 			break;
 
 		case T_WindowAggState:
-			result = ExecWindowAgg((WindowAggState *) node);
+			ExecWindowAgg((WindowAggState *) node);
 			break;
 
 		case T_UniqueState:
-			result = ExecUnique((UniqueState *) node);
+			ExecUnique((UniqueState *) node);
 			break;
 
 		case T_GatherState:
-			result = ExecGather((GatherState *) node);
+			ExecGather((GatherState *) node);
 			break;
 
 		case T_HashState:
-			result = ExecHash((HashState *) node);
+			ExecHash((HashState *) node);
 			break;
 
 		case T_SetOpState:
-			result = ExecSetOp((SetOpState *) node);
+			ExecSetOp((SetOpState *) node);
 			break;
 
 		case T_LockRowsState:
-			result = ExecLockRows((LockRowsState *) node);
+			ExecLockRows((LockRowsState *) node);
 			break;
 
 		case T_LimitState:
-			result = ExecLimit((LimitState *) node);
+			ExecLimit((LimitState *) node);
 			break;
 
 		default:
@@ -540,6 +543,14 @@ ExecProcNode(PlanState *node)
 			break;
 	}
 
+	/* We don't support asynchronous execution yet. */
+	Assert(node->result_ready);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	result = (TupleTableSlot *) node->result;
+
 	if (node->instrument)
 		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index fb0013d..095d40b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -99,7 +99,7 @@ ExecScanFetch(ScanState *node,
  *		ExecScan
  *
  *		Scans the relation using the 'access method' indicated and
- *		returns the next qualifying tuple in the direction specified
+ *		produces the next qualifying tuple in the direction specified
  *		in the global variable ExecDirection.
  *		The access method returns the next tuple and ExecScan() is
  *		responsible for checking the tuple returned against the qual-clause.
@@ -117,7 +117,7 @@ ExecScanFetch(ScanState *node,
  *			 "cursor" is positioned before the first qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecScan(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
 		 ExecScanRecheckMtd recheckMtd)
@@ -137,12 +137,14 @@ ExecScan(ScanState *node,
 
 	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
-	 * all the overhead and return the raw scan tuple.
+	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		ExecReturnTuple(&node->ps,
+						ExecScanFetch(node, accessMtd, recheckMtd));
+		return;
 	}
 
 	/*
@@ -155,7 +157,10 @@ ExecScan(ScanState *node,
 		Assert(projInfo);		/* can't get here if not projecting */
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -188,9 +193,10 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				return ExecClearTuple(projInfo->pi_slot);
+				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
 			else
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		/*
@@ -221,7 +227,8 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					return resultSlot;
+					ExecReturnTuple(&node->ps, resultSlot);
+					return;
 				}
 			}
 			else
@@ -229,7 +236,8 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2c11acb..e690dbd 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1797,7 +1797,7 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  *	  stored in the expression context to be used when ExecProject evaluates
  *	  the result tuple.
  */
-TupleTableSlot *
+void
 ExecAgg(AggState *node)
 {
 	TupleTableSlot *result;
@@ -1813,7 +1813,10 @@ ExecAgg(AggState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1823,6 +1826,7 @@ ExecAgg(AggState *node)
 	 * agg_done gets set before we emit the final aggregate tuple, and we have
 	 * to finish running SRFs for it.)
 	 */
+	result = NULL;
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -1837,12 +1841,9 @@ ExecAgg(AggState *node)
 				result = agg_retrieve_direct(node);
 				break;
 		}
-
-		if (!TupIsNull(result))
-			return result;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ss.ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index beb4ab8..e0ce8c6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -191,7 +191,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecAppend(AppendState *node)
 {
 	for (;;)
@@ -216,7 +216,8 @@ ExecAppend(AppendState *node)
 			 * NOT make use of the result slot that was set up in
 			 * ExecInitAppend; there's no need for it.
 			 */
-			return result;
+			ExecReturnTuple(&node->ps, result);
+			return;
 		}
 
 		/*
@@ -229,7 +230,11 @@ ExecAppend(AppendState *node)
 		else
 			node->as_whichplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			ExecReturnTuple(&node->ps,
+							ExecClearTuple(node->ps.ps_ResultTupleSlot));
+			return;
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2ba5cd0..31133ff 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -434,7 +434,7 @@ BitmapHeapRecheck(BitmapHeapScanState *node, TupleTableSlot *slot)
  *		ExecBitmapHeapScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecBitmapHeapScan(BitmapHeapScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 3c2f684..1f1fdf5 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -149,7 +149,7 @@ CteScanRecheck(CteScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecCteScan(CteScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 322abca..7162348 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -107,11 +107,11 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	return css;
 }
 
-TupleTableSlot *
+void
 ExecCustomScan(CustomScanState *node)
 {
 	Assert(node->methods->ExecCustomScan != NULL);
-	return node->methods->ExecCustomScan(node);
+	ExecReturnTuple(&node->ss.ps, node->methods->ExecCustomScan(node));
 }
 
 void
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 7d9160d..1f3e072 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -113,7 +113,7 @@ ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecForeignScan(ForeignScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index a03f6e7..3cccd8f 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -262,7 +262,7 @@ FunctionRecheck(FunctionScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecFunctionScan(FunctionScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 6da52b3..508ff75 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -126,7 +126,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
  *		the next qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecGather(GatherState *node)
 {
 	TupleTableSlot *fslot = node->funnel_slot;
@@ -207,7 +207,10 @@ ExecGather(GatherState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -229,7 +232,10 @@ ExecGather(GatherState *node)
 		 */
 		slot = gather_getnext(node);
 		if (TupIsNull(slot))
-			return NULL;
+		{
+			ExecReturnTuple(&node->ps, NULL);
+			return;
+		}
 
 		/*
 		 * form the result tuple using ExecProject(), and return it --- unless
@@ -242,11 +248,12 @@ ExecGather(GatherState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 3c066fc..f33a316 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -31,7 +31,7 @@
  *
  *		Return one tuple for each group of matching input tuples.
  */
-TupleTableSlot *
+void
 ExecGroup(GroupState *node)
 {
 	ExprContext *econtext;
@@ -44,7 +44,10 @@ ExecGroup(GroupState *node)
 	 * get state info from node
 	 */
 	if (node->grp_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ss.ps, NULL);
+		return;
+	}
 	econtext = node->ss.ps.ps_ExprContext;
 	numCols = ((Group *) node->ss.ps.plan)->numCols;
 	grpColIdx = ((Group *) node->ss.ps.plan)->grpColIdx;
@@ -61,7 +64,10 @@ ExecGroup(GroupState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -87,7 +93,8 @@ ExecGroup(GroupState *node)
 		{
 			/* empty input, so return nothing */
 			node->grp_done = TRUE;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 		/* Copy tuple into firsttupleslot */
 		ExecCopySlot(firsttupleslot, outerslot);
@@ -115,7 +122,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
@@ -139,7 +147,8 @@ ExecGroup(GroupState *node)
 			{
 				/* no more groups, so we're done */
 				node->grp_done = TRUE;
-				return NULL;
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
 			}
 
 			/*
@@ -178,7 +187,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 5e78de0..905eb30 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -56,11 +56,10 @@ static void *dense_alloc(HashJoinTable hashtable, Size size);
  *		stub for pro forma compliance
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecHash(HashState *node)
 {
 	elog(ERROR, "Hash node does not support ExecProcNode call convention");
-	return NULL;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index a7a908a..cc92fc3 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -58,7 +58,7 @@ static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecHashJoin(HashJoinState *node)
 {
 	PlanState  *outerNode;
@@ -93,7 +93,10 @@ ExecHashJoin(HashJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -155,7 +158,8 @@ ExecHashJoin(HashJoinState *node)
 					if (TupIsNull(node->hj_FirstOuterTupleSlot))
 					{
 						node->hj_OuterNotEmpty = false;
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 					}
 					else
 						node->hj_OuterNotEmpty = true;
@@ -183,7 +187,10 @@ ExecHashJoin(HashJoinState *node)
 				 * outer relation.
 				 */
 				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
-					return NULL;
+				{
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
+				}
 
 				/*
 				 * need to remember whether nbatch has increased since we
@@ -323,7 +330,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -362,7 +370,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -401,7 +410,8 @@ ExecHashJoin(HashJoinState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -414,7 +424,10 @@ ExecHashJoin(HashJoinState *node)
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecHashJoinNewBatch(node))
-					return NULL;	/* end of join */
+				{
+					ExecReturnTuple(&node->js.ps, NULL); /* end of join */
+					return;
+				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..47285a1 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -249,7 +249,7 @@ IndexOnlyRecheck(IndexOnlyScanState *node, TupleTableSlot *slot)
  *		ExecIndexOnlyScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexOnlyScan(IndexOnlyScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..6bf35d3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -482,7 +482,7 @@ reorderqueue_pop(IndexScanState *node)
  *		ExecIndexScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexScan(IndexScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 97267c5..4e70183 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -36,7 +36,7 @@ static void pass_down_bound(LimitState *node, PlanState *child_node);
  *		filtering on the stream of tuples returned by a subplan.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLimit(LimitState *node)
 {
 	ScanDirection direction;
@@ -72,7 +72,10 @@ ExecLimit(LimitState *node)
 			 * If backwards scan, just return NULL without changing state.
 			 */
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
@@ -80,7 +83,8 @@ ExecLimit(LimitState *node)
 			if (node->count <= 0 && !node->noCount)
 			{
 				node->lstate = LIMIT_EMPTY;
-				return NULL;
+				ExecReturnTuple(&node->ps, NULL);
+				return;
 			}
 
 			/*
@@ -96,7 +100,8 @@ ExecLimit(LimitState *node)
 					 * any output at all.
 					 */
 					node->lstate = LIMIT_EMPTY;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				if (++node->position > node->offset)
@@ -115,7 +120,8 @@ ExecLimit(LimitState *node)
 			 * The subplan is known to return no tuples (or not more than
 			 * OFFSET tuples, in general).  So we return no tuples.
 			 */
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
@@ -130,7 +136,8 @@ ExecLimit(LimitState *node)
 					node->position - node->offset >= node->count)
 				{
 					node->lstate = LIMIT_WINDOWEND;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -140,7 +147,8 @@ ExecLimit(LimitState *node)
 				if (TupIsNull(slot))
 				{
 					node->lstate = LIMIT_SUBPLANEOF;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				node->position++;
@@ -154,7 +162,8 @@ ExecLimit(LimitState *node)
 				if (node->position <= node->offset + 1)
 				{
 					node->lstate = LIMIT_WINDOWSTART;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -170,7 +179,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_SUBPLANEOF:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from subplan EOF, so re-fetch previous tuple; there
@@ -186,7 +198,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWEND:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from window end: simply re-return the last tuple
@@ -199,7 +214,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWSTART:
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Advancing after having backed off window start: simply
@@ -220,7 +238,7 @@ ExecLimit(LimitState *node)
 	/* Return the current tuple */
 	Assert(!TupIsNull(slot));
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /*
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index c4b5333..8daa203 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -35,7 +35,7 @@
  *		ExecLockRows
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLockRows(LockRowsState *node)
 {
 	TupleTableSlot *slot;
@@ -57,7 +57,10 @@ lnext:
 	slot = ExecProcNode(outerPlan);
 
 	if (TupIsNull(slot))
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
 	epq_needed = false;
@@ -334,7 +337,7 @@ lnext:
 	}
 
 	/* Got all locks, so return the current tuple */
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 82e31c1..fd3b013 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -35,7 +35,7 @@
  *
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* result tuple from subplan */
+void
 ExecMaterial(MaterialState *node)
 {
 	EState	   *estate;
@@ -93,7 +93,11 @@ ExecMaterial(MaterialState *node)
 			 * fetch.
 			 */
 			if (!tuplestore_advance(tuplestorestate, forward))
-				return NULL;	/* the tuplestore must be empty */
+			{
+				/* the tuplestore must be empty */
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
+			}
 		}
 		eof_tuplestore = false;
 	}
@@ -105,7 +109,10 @@ ExecMaterial(MaterialState *node)
 	if (!eof_tuplestore)
 	{
 		if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))
-			return slot;
+		{
+			ExecReturnTuple(&node->ss.ps, slot);
+			return;
+		}
 		if (forward)
 			eof_tuplestore = true;
 	}
@@ -132,7 +139,8 @@ ExecMaterial(MaterialState *node)
 		if (TupIsNull(outerslot))
 		{
 			node->eof_underlying = true;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 
 		/*
@@ -146,13 +154,14 @@ ExecMaterial(MaterialState *node)
 		/*
 		 * We can just return the subplan's returned tuple, without copying.
 		 */
-		return outerslot;
+		ExecReturnTuple(&node->ss.ps, outerslot);
+		return;
 	}
 
 	/*
 	 * Nothing left ...
 	 */
-	return ExecClearTuple(slot);
+	ExecReturnTuple(&node->ss.ps, ExecClearTuple(slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index ae0e8dc..3ef8120 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -164,7 +164,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeAppend(MergeAppendState *node)
 {
 	TupleTableSlot *result;
@@ -214,7 +214,7 @@ ExecMergeAppend(MergeAppendState *node)
 		result = node->ms_slots[i];
 	}
 
-	return result;
+	ExecReturnTuple(&node->ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index cd8d6c6..d73d9f4 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -615,7 +615,7 @@ ExecMergeTupleDump(MergeJoinState *mergestate)
  *		ExecMergeJoin
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeJoin(MergeJoinState *node)
 {
 	List	   *joinqual;
@@ -653,7 +653,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -710,7 +713,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillOuter(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -728,7 +734,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -765,7 +772,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillInner(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -785,7 +795,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -868,7 +879,8 @@ ExecMergeJoin(MergeJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -901,7 +913,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1003,7 +1018,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1039,7 +1057,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1174,7 +1193,8 @@ ExecMergeJoin(MergeJoinState *node)
 								break;
 							}
 							/* Otherwise we're done. */
-							return NULL;
+							ExecReturnTuple(&node->js.ps, NULL);
+							return;
 					}
 				}
 				break;
@@ -1256,7 +1276,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1292,7 +1315,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1318,7 +1342,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1362,7 +1389,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1388,7 +1416,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1406,7 +1437,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(innerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of inner subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDOUTER state and process next tuple. */
@@ -1434,7 +1466,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1448,7 +1483,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(outerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of outer subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDINNER state and process next tuple. */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95cc2c6..0e05d4d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1298,7 +1298,7 @@ fireASTriggers(ModifyTableState *node)
  *		if needed.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecModifyTable(ModifyTableState *node)
 {
 	EState	   *estate = node->ps.state;
@@ -1333,7 +1333,10 @@ ExecModifyTable(ModifyTableState *node)
 	 * extra times.
 	 */
 	if (node->mt_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/*
 	 * On first call, fire BEFORE STATEMENT triggers before proceeding.
@@ -1411,7 +1414,8 @@ ExecModifyTable(ModifyTableState *node)
 			slot = ExecProcessReturning(resultRelInfo, NULL, planSlot);
 
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		EvalPlanQualSetSlot(&node->mt_epqstate, planSlot);
@@ -1517,7 +1521,8 @@ ExecModifyTable(ModifyTableState *node)
 		if (slot)
 		{
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 	}
 
@@ -1531,7 +1536,7 @@ ExecModifyTable(ModifyTableState *node)
 
 	node->mt_done = true;
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 1895b60..54eff56 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -56,7 +56,7 @@
  *			   are prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecNestLoop(NestLoopState *node)
 {
 	NestLoop   *nl;
@@ -93,7 +93,10 @@ ExecNestLoop(NestLoopState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -128,7 +131,8 @@ ExecNestLoop(NestLoopState *node)
 			if (TupIsNull(outerTupleSlot))
 			{
 				ENL1_printf("no outer tuple, ending join");
-				return NULL;
+				ExecReturnTuple(&node->js.ps, NULL);
+				return;
 			}
 
 			ENL1_printf("saving new outer tuple information");
@@ -212,7 +216,8 @@ ExecNestLoop(NestLoopState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -270,7 +275,8 @@ ExecNestLoop(NestLoopState *node)
 				{
 					node->js.ps.ps_TupFromTlist =
 						(isDone == ExprMultipleResult);
-					return result;
+					ExecReturnTuple(&node->js.ps, result);
+					return;
 				}
 			}
 			else
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 2328ef3..6e78eb2 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -72,7 +72,7 @@ build_hash_table(RecursiveUnionState *rustate)
  * 2.6 go back to 2.2
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecRecursiveUnion(RecursiveUnionState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
@@ -102,7 +102,8 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 			/* Each non-duplicate tuple goes to the working table ... */
 			tuplestore_puttupleslot(node->working_table, slot);
 			/* ... and to the caller */
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 		node->recursing = true;
 	}
@@ -151,10 +152,11 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 		node->intermediate_empty = false;
 		tuplestore_puttupleslot(node->intermediate_table, slot);
 		/* ... and return it */
-		return slot;
+		ExecReturnTuple(&node->ps, slot);
+		return;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 0d2de14..a830ffd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -63,7 +63,7 @@
  *		'nil' if the constant qualification is not satisfied.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecResult(ResultState *node)
 {
 	TupleTableSlot *outerTupleSlot;
@@ -87,7 +87,8 @@ ExecResult(ResultState *node)
 		if (!qualResult)
 		{
 			node->rs_done = true;
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 		}
 	}
 
@@ -100,7 +101,10 @@ ExecResult(ResultState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -130,7 +134,10 @@ ExecResult(ResultState *node)
 			outerTupleSlot = ExecProcNode(outerPlan);
 
 			if (TupIsNull(outerTupleSlot))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * prepare to compute projection expressions, which will expect to
@@ -157,11 +164,12 @@ ExecResult(ResultState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 9ce7c02..89cce0e 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -95,7 +95,7 @@ SampleRecheck(SampleScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSampleScan(SampleScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 00bf3a5..0ca86d9 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -121,7 +121,7 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSeqScan(SeqScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 7a3b67c..b7a593f 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -191,7 +191,7 @@ set_output_count(SetOpState *setopstate, SetOpStatePerGroup pergroup)
  *		ExecSetOp
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecSetOp(SetOpState *node)
 {
 	SetOp	   *plannode = (SetOp *) node->ps.plan;
@@ -204,22 +204,26 @@ ExecSetOp(SetOpState *node)
 	if (node->numOutput > 0)
 	{
 		node->numOutput--;
-		return resultTupleSlot;
+		ExecReturnTuple(&node->ps, resultTupleSlot);
+		return;
 	}
 
 	/* Otherwise, we're done if we are out of groups */
 	if (node->setop_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* Fetch the next tuple group according to the correct strategy */
 	if (plannode->strategy == SETOP_HASHED)
 	{
 		if (!node->table_filled)
 			setop_fill_hash_table(node);
-		return setop_retrieve_hash_table(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_hash_table(node));
 	}
 	else
-		return setop_retrieve_direct(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_direct(node));
 }
 
 /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 0286a7f..13f721a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -35,7 +35,7 @@
  *		  -- the outer child is prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSort(SortState *node)
 {
 	EState	   *estate;
@@ -138,7 +138,7 @@ ExecSort(SortState *node)
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  slot, NULL);
-	return slot;
+	ExecReturnTuple(&node->ss.ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index cb007a5..0562926 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -79,7 +79,7 @@ SubqueryRecheck(SubqueryScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSubqueryScan(SubqueryScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 2604103..e2a0479 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -387,7 +387,7 @@ TidRecheck(TidScanState *node, TupleTableSlot *slot)
  *		  -- tidPtr is -1.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecTidScan(TidScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 5d13a89..2daa001 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -42,7 +42,7 @@
  *		ExecUnique
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecUnique(UniqueState *node)
 {
 	Unique	   *plannode = (Unique *) node->ps.plan;
@@ -70,8 +70,8 @@ ExecUnique(UniqueState *node)
 		if (TupIsNull(slot))
 		{
 			/* end of subplan, so we're done */
-			ExecClearTuple(resultTupleSlot);
-			return NULL;
+			ExecReturnTuple(&node->ps, ExecClearTuple(resultTupleSlot));
+			return;
 		}
 
 		/*
@@ -98,7 +98,7 @@ ExecUnique(UniqueState *node)
 	 * won't guarantee that this source tuple is still accessible after
 	 * fetching the next source tuple.
 	 */
-	return ExecCopySlot(resultTupleSlot, slot);
+	ExecReturnTuple(&node->ps, ExecCopySlot(resultTupleSlot, slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 9c03f8a..3e6c321 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -186,7 +186,7 @@ ValuesRecheck(ValuesScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecValuesScan(ValuesScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index bae713b..62fe48b 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1555,7 +1555,7 @@ update_frametailpos(WindowObject winobj, TupleTableSlot *slot)
  *	(ignoring the case of SRFs in the targetlist, that is).
  * -----------------
  */
-TupleTableSlot *
+void
 ExecWindowAgg(WindowAggState *winstate)
 {
 	TupleTableSlot *result;
@@ -1565,7 +1565,10 @@ ExecWindowAgg(WindowAggState *winstate)
 	int			numfuncs;
 
 	if (winstate->all_done)
-		return NULL;
+	{
+		ExecReturnTuple(&winstate->ss.ps, NULL);
+		return;
+	}
 
 	/*
 	 * Check to see if we're still projecting out tuples from a previous
@@ -1579,7 +1582,10 @@ ExecWindowAgg(WindowAggState *winstate)
 
 		result = ExecProject(winstate->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&winstate->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		winstate->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1687,7 +1693,8 @@ restart:
 		else
 		{
 			winstate->all_done = true;
-			return NULL;
+			ExecReturnTuple(&winstate->ss.ps, NULL);
+			return;
 		}
 	}
 
@@ -1753,7 +1760,7 @@ restart:
 
 	winstate->ss.ps.ps_TupFromTlist =
 		(isDone == ExprMultipleResult);
-	return result;
+	ExecReturnTuple(&winstate->ss.ps, result);
 }
 
 /* -----------------
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index cfed6e6..c3615b2 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -77,7 +77,7 @@ WorkTableScanRecheck(WorkTableScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecWorkTableScan(WorkTableScanState *node)
 {
 	/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 28c0c2e..1eb09d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -228,6 +228,15 @@ extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
+/* Convenience function to set a node's result to a TupleTableSlot. */
+static inline void
+ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
+{
+	Assert(!node->result_ready);
+	node->result = (Node *) slot;
+	node->result_ready = true;
+}
+
 /*
  * prototypes from functions in execQual.c
  */
@@ -256,7 +265,7 @@ extern TupleTableSlot *ExecProject(ProjectionInfo *projInfo,
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
 
-extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
+extern void ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 		 ExecScanRecheckMtd recheckMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, Index varno);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 54c75e8..b86ec6a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAgg(AggState *node);
+extern void ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..70a6b62 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AppendState *ExecInitAppend(Append *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAppend(AppendState *node);
+extern void ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 0ed9c78..069dbc7 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern BitmapHeapScanState *ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecBitmapHeapScan(BitmapHeapScanState *node);
+extern void ExecBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecEndBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecReScanBitmapHeapScan(BitmapHeapScanState *node);
 
diff --git a/src/include/executor/nodeCtescan.h b/src/include/executor/nodeCtescan.h
index ef5c2bc..8411fa1 100644
--- a/src/include/executor/nodeCtescan.h
+++ b/src/include/executor/nodeCtescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern CteScanState *ExecInitCteScan(CteScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecCteScan(CteScanState *node);
+extern void ExecCteScan(CteScanState *node);
 extern void ExecEndCteScan(CteScanState *node);
 extern void ExecReScanCteScan(CteScanState *node);
 
diff --git a/src/include/executor/nodeCustom.h b/src/include/executor/nodeCustom.h
index 7d16c2b..5df2ebb 100644
--- a/src/include/executor/nodeCustom.h
+++ b/src/include/executor/nodeCustom.h
@@ -21,7 +21,7 @@
  */
 extern CustomScanState *ExecInitCustomScan(CustomScan *custom_scan,
 				   EState *estate, int eflags);
-extern TupleTableSlot *ExecCustomScan(CustomScanState *node);
+extern void ExecCustomScan(CustomScanState *node);
 extern void ExecEndCustomScan(CustomScanState *node);
 
 extern void ExecReScanCustomScan(CustomScanState *node);
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3d0f7bd 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecForeignScan(ForeignScanState *node);
+extern void ExecForeignScan(ForeignScanState *node);
 extern void ExecEndForeignScan(ForeignScanState *node);
 extern void ExecReScanForeignScan(ForeignScanState *node);
 
diff --git a/src/include/executor/nodeFunctionscan.h b/src/include/executor/nodeFunctionscan.h
index d6e7a61..15beb13 100644
--- a/src/include/executor/nodeFunctionscan.h
+++ b/src/include/executor/nodeFunctionscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern FunctionScanState *ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecFunctionScan(FunctionScanState *node);
+extern void ExecFunctionScan(FunctionScanState *node);
 extern void ExecEndFunctionScan(FunctionScanState *node);
 extern void ExecReScanFunctionScan(FunctionScanState *node);
 
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
index f76d9be..100a827 100644
--- a/src/include/executor/nodeGather.h
+++ b/src/include/executor/nodeGather.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGather(GatherState *node);
+extern void ExecGather(GatherState *node);
 extern void ExecEndGather(GatherState *node);
 extern void ExecShutdownGather(GatherState *node);
 extern void ExecReScanGather(GatherState *node);
diff --git a/src/include/executor/nodeGroup.h b/src/include/executor/nodeGroup.h
index 92639f5..446ded5 100644
--- a/src/include/executor/nodeGroup.h
+++ b/src/include/executor/nodeGroup.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GroupState *ExecInitGroup(Group *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGroup(GroupState *node);
+extern void ExecGroup(GroupState *node);
 extern void ExecEndGroup(GroupState *node);
 extern void ExecReScanGroup(GroupState *node);
 
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 8cf6d15..b395fd9 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHash(HashState *node);
+extern void ExecHash(HashState *node);
 extern Node *MultiExecHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index f24127a..072c610 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -18,7 +18,7 @@
 #include "storage/buffile.h"
 
 extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+extern void ExecHashJoin(HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
diff --git a/src/include/executor/nodeIndexonlyscan.h b/src/include/executor/nodeIndexonlyscan.h
index d63d194..0fbcf80 100644
--- a/src/include/executor/nodeIndexonlyscan.h
+++ b/src/include/executor/nodeIndexonlyscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexOnlyScanState *ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexOnlyScan(IndexOnlyScanState *node);
+extern void ExecIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecEndIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecIndexOnlyMarkPos(IndexOnlyScanState *node);
 extern void ExecIndexOnlyRestrPos(IndexOnlyScanState *node);
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..341dab3 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
+extern void ExecIndexScan(IndexScanState *node);
 extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 96166b4..03dde30 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLimit(LimitState *node);
+extern void ExecLimit(LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/executor/nodeLockRows.h b/src/include/executor/nodeLockRows.h
index e828e9c..eda3cbec 100644
--- a/src/include/executor/nodeLockRows.h
+++ b/src/include/executor/nodeLockRows.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LockRowsState *ExecInitLockRows(LockRows *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLockRows(LockRowsState *node);
+extern void ExecLockRows(LockRowsState *node);
 extern void ExecEndLockRows(LockRowsState *node);
 extern void ExecReScanLockRows(LockRowsState *node);
 
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index 2b8cae1..20bc7f6 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMaterial(MaterialState *node);
+extern void ExecMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
 extern void ExecMaterialRestrPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergeAppend.h b/src/include/executor/nodeMergeAppend.h
index 0efc489..e43b5e6 100644
--- a/src/include/executor/nodeMergeAppend.h
+++ b/src/include/executor/nodeMergeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeAppendState *ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeAppend(MergeAppendState *node);
+extern void ExecMergeAppend(MergeAppendState *node);
 extern void ExecEndMergeAppend(MergeAppendState *node);
 extern void ExecReScanMergeAppend(MergeAppendState *node);
 
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index 74d691c..dfdbc1b 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
+extern void ExecMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 6b66353..fe67248 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -16,7 +16,7 @@
 #include "nodes/execnodes.h"
 
 extern ModifyTableState *ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecModifyTable(ModifyTableState *node);
+extern void ExecModifyTable(ModifyTableState *node);
 extern void ExecEndModifyTable(ModifyTableState *node);
 extern void ExecReScanModifyTable(ModifyTableState *node);
 
diff --git a/src/include/executor/nodeNestloop.h b/src/include/executor/nodeNestloop.h
index eeb42d6..cab1885 100644
--- a/src/include/executor/nodeNestloop.h
+++ b/src/include/executor/nodeNestloop.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern NestLoopState *ExecInitNestLoop(NestLoop *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecNestLoop(NestLoopState *node);
+extern void ExecNestLoop(NestLoopState *node);
 extern void ExecEndNestLoop(NestLoopState *node);
 extern void ExecReScanNestLoop(NestLoopState *node);
 
diff --git a/src/include/executor/nodeRecursiveunion.h b/src/include/executor/nodeRecursiveunion.h
index 1c08790..fb11eca 100644
--- a/src/include/executor/nodeRecursiveunion.h
+++ b/src/include/executor/nodeRecursiveunion.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern RecursiveUnionState *ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecRecursiveUnion(RecursiveUnionState *node);
+extern void ExecRecursiveUnion(RecursiveUnionState *node);
 extern void ExecEndRecursiveUnion(RecursiveUnionState *node);
 extern void ExecReScanRecursiveUnion(RecursiveUnionState *node);
 
diff --git a/src/include/executor/nodeResult.h b/src/include/executor/nodeResult.h
index 356027f..951fae6 100644
--- a/src/include/executor/nodeResult.h
+++ b/src/include/executor/nodeResult.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ResultState *ExecInitResult(Result *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecResult(ResultState *node);
+extern void ExecResult(ResultState *node);
 extern void ExecEndResult(ResultState *node);
 extern void ExecResultMarkPos(ResultState *node);
 extern void ExecResultRestrPos(ResultState *node);
diff --git a/src/include/executor/nodeSamplescan.h b/src/include/executor/nodeSamplescan.h
index c8f03d8..4ab6e5a 100644
--- a/src/include/executor/nodeSamplescan.h
+++ b/src/include/executor/nodeSamplescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SampleScanState *ExecInitSampleScan(SampleScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSampleScan(SampleScanState *node);
+extern void ExecSampleScan(SampleScanState *node);
 extern void ExecEndSampleScan(SampleScanState *node);
 extern void ExecReScanSampleScan(SampleScanState *node);
 
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index f2e61ff..816d1a5 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern void ExecSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index c6e9603..dd88afb 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSetOp(SetOpState *node);
+extern void ExecSetOp(SetOpState *node);
 extern void ExecEndSetOp(SetOpState *node);
 extern void ExecReScanSetOp(SetOpState *node);
 
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 481065f..f65037d 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSort(SortState *node);
+extern void ExecSort(SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/executor/nodeSubqueryscan.h b/src/include/executor/nodeSubqueryscan.h
index 427699b..a3962c7 100644
--- a/src/include/executor/nodeSubqueryscan.h
+++ b/src/include/executor/nodeSubqueryscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SubqueryScanState *ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSubqueryScan(SubqueryScanState *node);
+extern void ExecSubqueryScan(SubqueryScanState *node);
 extern void ExecEndSubqueryScan(SubqueryScanState *node);
 extern void ExecReScanSubqueryScan(SubqueryScanState *node);
 
diff --git a/src/include/executor/nodeTidscan.h b/src/include/executor/nodeTidscan.h
index 76c2a9f..5b7bbfd 100644
--- a/src/include/executor/nodeTidscan.h
+++ b/src/include/executor/nodeTidscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern TidScanState *ExecInitTidScan(TidScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecTidScan(TidScanState *node);
+extern void ExecTidScan(TidScanState *node);
 extern void ExecEndTidScan(TidScanState *node);
 extern void ExecReScanTidScan(TidScanState *node);
 
diff --git a/src/include/executor/nodeUnique.h b/src/include/executor/nodeUnique.h
index aa8491d..b53a553 100644
--- a/src/include/executor/nodeUnique.h
+++ b/src/include/executor/nodeUnique.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern UniqueState *ExecInitUnique(Unique *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecUnique(UniqueState *node);
+extern void ExecUnique(UniqueState *node);
 extern void ExecEndUnique(UniqueState *node);
 extern void ExecReScanUnique(UniqueState *node);
 
diff --git a/src/include/executor/nodeValuesscan.h b/src/include/executor/nodeValuesscan.h
index 026f261..90288fc 100644
--- a/src/include/executor/nodeValuesscan.h
+++ b/src/include/executor/nodeValuesscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ValuesScanState *ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecValuesScan(ValuesScanState *node);
+extern void ExecValuesScan(ValuesScanState *node);
 extern void ExecEndValuesScan(ValuesScanState *node);
 extern void ExecReScanValuesScan(ValuesScanState *node);
 
diff --git a/src/include/executor/nodeWindowAgg.h b/src/include/executor/nodeWindowAgg.h
index 94ed037..f5e2c98 100644
--- a/src/include/executor/nodeWindowAgg.h
+++ b/src/include/executor/nodeWindowAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WindowAggState *ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWindowAgg(WindowAggState *node);
+extern void ExecWindowAgg(WindowAggState *node);
 extern void ExecEndWindowAgg(WindowAggState *node);
 extern void ExecReScanWindowAgg(WindowAggState *node);
 
diff --git a/src/include/executor/nodeWorktablescan.h b/src/include/executor/nodeWorktablescan.h
index 217208a..7b1eecb 100644
--- a/src/include/executor/nodeWorktablescan.h
+++ b/src/include/executor/nodeWorktablescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WorkTableScanState *ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWorkTableScan(WorkTableScanState *node);
+extern void ExecWorkTableScan(WorkTableScanState *node);
 extern void ExecEndWorkTableScan(WorkTableScanState *node);
 extern void ExecReScanWorkTableScan(WorkTableScanState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4b18436..ff6c453 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1031,6 +1031,8 @@ typedef struct PlanState
 								 * top-level plan */
 
 	struct PlanState *parent;	/* node which will receive tuples from us */
+	bool		result_ready;	/* true if result is ready */
+	Node	   *result;			/* result, most often TupleTableSlot */
 
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
-- 
1.8.3.1

0003-Lightweight-framework-for-waiting-for-events.patchtext/x-patch; charset=us-asciiDownload
From df659418c127f675121c684eefa80b3dcf26afc2 Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 9 May 2016 11:48:11 -0400
Subject: [PATCH 3/7] Lightweight framework for waiting for events.

---
 src/backend/executor/Makefile       |   4 +-
 src/backend/executor/execAsync.c    | 256 ++++++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c |  82 ++++++++----
 src/include/executor/execAsync.h    |  23 ++++
 src/include/executor/executor.h     |   2 +
 src/include/nodes/execnodes.h       |  10 ++
 6 files changed, 352 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..20601fa
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,256 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * This file contains routines that are intended to asynchronous
+ * execution; that is, suspending an executor node until some external
+ * event occurs, or until one of its child nodes produces a tuple.
+ * This allows the executor to avoid blocking on a single external event,
+ * such as a file descriptor waiting on I/O, or a parallel worker which
+ * must complete work elsewhere in the plan tree, when there might at the
+ * same time be useful computation that could be accomplished in some
+ * other part of the plan tree.
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/executor.h"
+#include "storage/latch.h"
+
+#define	EVENT_BUFFER_SIZE		16
+
+static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+
+void
+ExecAsyncWaitForNode(PlanState *planstate)
+{
+	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
+	PlanState  *callbacks[EVENT_BUFFER_SIZE];
+	int			ncallbacks = 0;
+	EState *estate = planstate->state;
+
+	while (!planstate->result_ready)
+	{
+		bool	reinit = (estate->es_wait_event_set == NULL);
+		int		n;
+		int		noccurred;
+
+		if (reinit)
+		{
+			/*
+			 * Allow for a few extra events without reinitializing.  It
+			 * doesn't seem worth the complexity of doing anything very
+			 * aggressive here, because plans that depend on massive numbers
+			 * of external FDs are likely to run afoul of kernel limits anyway.
+			 */
+			estate->es_max_async_events = estate->es_total_async_events + 16;
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_max_async_events);
+		}
+
+		/* Give each waiting node a chance to add or modify events. */
+		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+
+		/* Wait for at least one event to occur. */
+		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+									 occurred_event, EVENT_BUFFER_SIZE);
+		Assert(noccurred > 0);
+
+		/*
+		 * Loop over the occurred events and make a list of nodes that need
+		 * a callback.  The waiting nodes should have registered their wait
+		 * events with user_data pointing back to the node.
+		 */
+		for (n = 0; n < noccurred; ++n)
+		{
+			WaitEvent  *w = &occurred_event[n];
+			PlanState  *ps = w->user_data;
+
+			callbacks[ncallbacks++] = ps;
+		}
+
+		/*
+		 * Initially, this loop will call the node-type-specific function for
+		 * each node for which an event occurred.  If any of those nodes
+		 * produce a result, its parent enters the set of nodes that are
+		 * pending for a callback.  In this way, when a result becomes
+		 * available in a leaf of the plan tree, it can bubble upwards towards
+		 * the root as far as necessary.
+		 */
+		while (ncallbacks > 0)
+		{
+			int		i,
+					j;
+
+			/* Loop over all callbacks. */
+			for (i = 0; i < ncallbacks; ++i)
+			{
+				/* Skip if NULL. */
+				if (callbacks[i] == NULL)
+					continue;
+
+				/*
+				 * Remove any duplicates.  O(n) may not seem good, but it
+				 * should hopefully be OK as long as EVENT_BUFFER_SIZE is
+				 * not too large.
+				 */
+				for (j = i + 1; j < ncallbacks; ++j)
+					if (callbacks[i] == callbacks[j])
+						callbacks[j] = NULL;
+
+				/* Dispatch to node-type-specific code. */
+				ExecDispatchNode(callbacks[i]);
+
+				/*
+				 * If there's now a tuple ready, we must dispatch to the
+				 * parent node; otherwise, there's nothing more to do.
+				 */
+				if (callbacks[i]->result_ready)
+					callbacks[i] = callbacks[i]->parent;
+				else
+					callbacks[i] = NULL;
+			}
+
+			/* Squeeze out NULLs. */
+			for (i = 0, j = 0; j < ncallbacks; ++j)
+				if (callbacks[j] != NULL)
+					callbacks[i++] = callbacks[j];
+			ncallbacks = i;
+		}
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one more or events that can be registered on a WaitEventSet.  nevents
+ * should be the maximum number of events that it will wish to register.
+ * reinit should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
+{
+	EState *estate = planstate->state;
+
+	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
+
+	/*
+	 * If this node is not already present in the array of waiting nodes,
+	 * then add it.  If that array hasn't been allocated or is full, this may
+	 * require (re)allocating it.
+	 */
+	if (planstate->n_async_events == 0)
+	{
+		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		{
+			int		newmax;
+
+			if (estate->es_max_waiting_nodes == 0)
+			{
+				newmax = 16;
+				estate->es_waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt, newmax);
+			}
+			else
+			{
+				newmax = estate->es_max_waiting_nodes * 2;
+				estate->es_waiting_nodes =
+					repalloc(estate->es_waiting_nodes,
+							 newmax * sizeof(PlanState *));
+			}
+			estate->es_max_waiting_nodes = newmax;
+		}
+		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+	}
+
+	/* Adjust per-node and per-estate totals. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = nevents;
+	estate->es_total_async_events += planstate->n_async_events;
+
+	/*
+	 * If a WaitEventSet has already been created, we need to discard it and
+	 * start again if the user passed reinit = true, or if the total number of
+	 * required events exceeds the supported number.
+	 */
+	if (estate->es_wait_event_set != NULL && (reinit ||
+		estate->es_total_async_events > estate->es_max_async_events))
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * If an executor node no longer needs to wait, it should call this function
+ * to report that fact.
+ */
+void
+ExecAsyncDoesNotNeedWait(PlanState *planstate)
+{
+	int		n;
+	EState *estate = planstate->state;
+
+	if (planstate->n_async_events <= 0)
+		return;
+
+	/*
+	 * Remove the node from the list of waiting nodes.  (Is a linear search
+	 * going to be a problem here?  I think probably not.)
+	 */
+	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	{
+		if (estate->es_waiting_nodes[n] == planstate)
+		{
+			estate->es_waiting_nodes[n] =
+				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+			break;
+		}
+	}
+
+	/* We should always find ourselves in the array. */
+	Assert(n < estate->es_num_waiting_nodes);
+
+	/* We no longer need any asynchronous events. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = 0;
+
+	/*
+	 * The next wait will need to rebuild the WaitEventSet, because whatever
+	 * events we registered are gone now.  It's probably OK that this code
+	 * assumes we actually did register some events at one point, because we
+	 * needed to wait at some point and we don't any more.
+	 */
+	if (estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * Give per-nodetype function a chance to register wait events.
+ */
+static void
+ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+{
+	switch (nodeTag(planstate))
+	{
+		/* XXX: Add calls to per-nodetype handlers here. */
+		default:
+			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
+	}
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3f2ebff..b7ac08e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -77,6 +77,7 @@
  */
 #include "postgres.h"
 
+#include "executor/execAsync.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "executor/nodeAppend.h"
@@ -368,24 +369,14 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 
 
 /* ----------------------------------------------------------------
- *		ExecProcNode
+ *		ExecDispatchNode
  *
- *		Execute the given node to return a(nother) tuple.
+ *		Invoke the given node's dispatch function.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
-ExecProcNode(PlanState *node)
+void
+ExecDispatchNode(PlanState *node)
 {
-	TupleTableSlot *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -539,22 +530,67 @@ ExecProcNode(PlanState *node)
 
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
 			break;
 	}
 
-	/* We don't support asynchronous execution yet. */
-	Assert(node->result_ready);
+	if (node->instrument)
+	{
+		double	nTuples = 0.0;
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+		if (node->result_ready && node->result != NULL &&
+			IsA(node->result, TupleTableSlot))
+			nTuples = 1.0;
 
-	result = (TupleTableSlot *) node->result;
+		InstrStopNode(node->instrument, nTuples);
+	}
+}
 
-	if (node->instrument)
-		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
-	return result;
+/* ----------------------------------------------------------------
+ *		ExecExecuteNode
+ *
+ *		Request the next tuple from the given node.  Note that
+ *		if the node supports asynchrony, result_ready may not be
+ *		set on return (use ExecProcNode if you need that, or call
+ *		ExecAsyncWaitForNode).
+ * ----------------------------------------------------------------
+ */
+void
+ExecExecuteNode(PlanState *node)
+{
+	node->result_ready = false;
+	ExecDispatchNode(node);
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecProcNode
+ *
+ *		Get the next tuple from the given node.  If the node is
+ *		asynchronous, wait for a tuple to be ready before
+ *		returning.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecProcNode(PlanState *node)
+{
+	CHECK_FOR_INTERRUPTS();
+
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	ExecDispatchNode(node);
+
+	if (!node->result_ready)
+		ExecAsyncWaitForNode(node);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	return (TupleTableSlot *) node->result;
 }
 
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..38b37a1
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.h
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execAsync.h
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncWaitForNode(PlanState *planstate);
+extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
+	bool reinit);
+extern void ExecAsyncDoesNotNeedWait(PlanState *planstate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 1eb09d8..7abc361 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -223,6 +223,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
 			 int eflags);
+extern void ExecDispatchNode(PlanState *node);
+extern void ExecExecuteNode(PlanState *node);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ff6c453..76e36a2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -382,6 +382,14 @@ typedef struct EState
 	ParamListInfo es_param_list_info;	/* values of external params */
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
+	/* Asynchronous execution support */
+	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
+	int			es_num_waiting_nodes;	/* # of waiters in array */
+	int			es_max_waiting_nodes;	/* # of allocated entries */
+	int			es_total_async_events;	/* total of per-node n_async_events */
+	int			es_max_async_events;	/* # supported by event set */
+	struct WaitEventSet *es_wait_event_set;
+
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
 
@@ -1034,6 +1042,8 @@ typedef struct PlanState
 	bool		result_ready;	/* true if result is ready */
 	Node	   *result;			/* result, most often TupleTableSlot */
 
+	int			n_async_events;	/* # of async events we want to register */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
1.8.3.1

0004-Fix-async-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From d5c5d799591dacecb65229741676d8bc41229919 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 17:23:16 +0900
Subject: [PATCH 4/7] Fix async execution framework.

This commit changes some behavior of the framework and fixes some
minor bugs.
---
 src/backend/executor/execAsync.c    | 141 +++++++++++++++++++++++++-----------
 src/backend/executor/execProcnode.c |  33 ++++++---
 src/backend/executor/execScan.c     |  33 +++++++--
 src/backend/executor/execUtils.c    |   8 ++
 src/backend/executor/nodeSeqscan.c  |   7 +-
 src/include/executor/execAsync.h    |   7 ++
 src/include/executor/executor.h     |  20 +++--
 src/include/nodes/execnodes.h       |  26 +++++--
 8 files changed, 199 insertions(+), 76 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 20601fa..6da7ef2 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -29,7 +29,7 @@
 
 #define	EVENT_BUFFER_SIZE		16
 
-static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+static bool ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode mode);
 
 void
 ExecAsyncWaitForNode(PlanState *planstate)
@@ -37,13 +37,14 @@ ExecAsyncWaitForNode(PlanState *planstate)
 	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
 	PlanState  *callbacks[EVENT_BUFFER_SIZE];
 	int			ncallbacks = 0;
-	EState *estate = planstate->state;
+	EState     *estate = planstate->state;
 
 	while (!planstate->result_ready)
 	{
-		bool	reinit = (estate->es_wait_event_set == NULL);
+		bool	reinit = (estate->wait_event_set == NULL);
 		int		n;
 		int		noccurred;
+		bool	has_event = false;
 
 		if (reinit)
 		{
@@ -53,18 +54,68 @@ ExecAsyncWaitForNode(PlanState *planstate)
 			 * aggressive here, because plans that depend on massive numbers
 			 * of external FDs are likely to run afoul of kernel limits anyway.
 			 */
-			estate->es_max_async_events = estate->es_total_async_events + 16;
-			estate->es_wait_event_set =
-				CreateWaitEventSet(estate->es_query_cxt,
-								   estate->es_max_async_events);
+			estate->max_events = estate->total_events + 16;
+			estate->wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt, estate->max_events);
 		}
 
-		/* Give each waiting node a chance to add or modify events. */
-		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
-			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+		/*
+		 * Give each waiting node a chance to add or modify events to the
+		 * descendants of this planstate.
+		 */
+		for (n = 0; n < estate->num_waiting_nodes; ++n)
+		{
+			PlanState *node = estate->waiting_nodes[n];
+
+			/*
+			 * We assume that few nodes are async-aware and async-unaware
+			 * nodes cannot be revserse-dispatched from lower nodes that is
+			 * async-aware. Firing of an async node that is not a descendant
+			 * of the planstate will cause such reverse-diaptching to
+			 * async-aware nodes, which is unexpected behavior for them.
+			 *
+			 * For instance, consider an async-unaware Hashjoin(OUTER, INNER)
+			 * where the OUTER is running asynchronously but the Hashjoin is
+			 * waiting on the async INNER during inner-hash creation. If the
+			 * OUTER fires for the case, since anyone is waiting on it,
+			 * ExecAsyncWaitForNode finally dispatches to the Hashjoin which
+			 * is now in the middle of thing its work.
+			 */
+			if (!IsParent(planstate, node))
+				continue;
+
+			has_event |= 
+				ExecAsyncConfigureWait(node,
+					   reinit ? ASYNCCONF_TRY_ADD : ASYNCCONF_MODIFY);
+		}
+
+		if (!has_event)
+		{
+			/*
+			 * No event to wait. This occurs when all of the waiters share the
+			 * same object for sync with the nodes in other
+			 * sync-subtree. Anyway we must have at least one event to wait.
+			 */
+
+			 for (n = 0; n < estate->num_waiting_nodes; ++n)
+			 {
+				 PlanState *node = estate->waiting_nodes[n];
 
-		/* Wait for at least one event to occur. */
-		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+				 /* Skip if this node is not a descendant of planstate */
+				 if (!IsParent(planstate, node))
+					 continue;
+
+				 if (ExecAsyncConfigureWait(node, ASYNCCONF_FORCE_ADD))
+					 break;
+			 }
+
+			 /* Too bad. We don't have anyone to wait. Something wrong. */
+			 if (n == estate->num_waiting_nodes)
+				 ereport(ERROR,
+						 (errmsg("inconsistency in asynchronous execution")));
+		}
+
+		noccurred = WaitEventSetWait(estate->wait_event_set, -1,
 									 occurred_event, EVENT_BUFFER_SIZE);
 		Assert(noccurred > 0);
 
@@ -115,9 +166,10 @@ ExecAsyncWaitForNode(PlanState *planstate)
 
 				/*
 				 * If there's now a tuple ready, we must dispatch to the
-				 * parent node; otherwise, there's nothing more to do.
+				 * parent node up to the waiting root; otherwise, there's
+				 * nothing more to do.
 				 */
-				if (callbacks[i]->result_ready)
+				if (callbacks[i]->result_ready && callbacks[i] != planstate)
 					callbacks[i] = callbacks[i]->parent;
 				else
 					callbacks[i] = NULL;
@@ -143,7 +195,7 @@ ExecAsyncWaitForNode(PlanState *planstate)
 void
 ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
 {
-	EState *estate = planstate->state;
+	EState     *estate = planstate->state;
 
 	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
 
@@ -154,43 +206,45 @@ ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
 	 */
 	if (planstate->n_async_events == 0)
 	{
-		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		if (estate->max_waiting_nodes <= estate->num_waiting_nodes)
 		{
 			int		newmax;
 
-			if (estate->es_max_waiting_nodes == 0)
+			if (estate->max_waiting_nodes == 0)
 			{
 				newmax = 16;
-				estate->es_waiting_nodes =
-					MemoryContextAlloc(estate->es_query_cxt, newmax);
+				estate->waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt,
+									   newmax * sizeof(PlanState *));
 			}
 			else
 			{
-				newmax = estate->es_max_waiting_nodes * 2;
-				estate->es_waiting_nodes =
-					repalloc(estate->es_waiting_nodes,
+				newmax = estate->max_waiting_nodes * 2;
+				estate->waiting_nodes =
+					repalloc(estate->waiting_nodes,
 							 newmax * sizeof(PlanState *));
 			}
-			estate->es_max_waiting_nodes = newmax;
+			estate->max_waiting_nodes = newmax;
 		}
-		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+		estate->waiting_nodes[estate->num_waiting_nodes++] =
+			planstate;
 	}
 
-	/* Adjust per-node and per-estate totals. */
-	estate->es_total_async_events -= planstate->n_async_events;
+	/* Adjust per-node and per-asstate totals. */
+	estate->total_events -= planstate->n_async_events;
 	planstate->n_async_events = nevents;
-	estate->es_total_async_events += planstate->n_async_events;
+	estate->total_events += planstate->n_async_events;
 
 	/*
 	 * If a WaitEventSet has already been created, we need to discard it and
 	 * start again if the user passed reinit = true, or if the total number of
 	 * required events exceeds the supported number.
 	 */
-	if (estate->es_wait_event_set != NULL && (reinit ||
-		estate->es_total_async_events > estate->es_max_async_events))
+	if (estate->wait_event_set != NULL && (reinit ||
+		estate->total_events > estate->max_events))
 	{
-		FreeWaitEventSet(estate->es_wait_event_set);
-		estate->es_wait_event_set = NULL;
+		FreeWaitEventSet(estate->wait_event_set);
+		estate->wait_event_set = NULL;
 	}
 }
 
@@ -211,21 +265,20 @@ ExecAsyncDoesNotNeedWait(PlanState *planstate)
 	 * Remove the node from the list of waiting nodes.  (Is a linear search
 	 * going to be a problem here?  I think probably not.)
 	 */
-	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	for (n = 0; n < estate->num_waiting_nodes; ++n)
 	{
-		if (estate->es_waiting_nodes[n] == planstate)
-		{
-			estate->es_waiting_nodes[n] =
-				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+		if (estate->waiting_nodes[n] == planstate)
 			break;
-		}
 	}
 
 	/* We should always find ourselves in the array. */
-	Assert(n < estate->es_num_waiting_nodes);
+	Assert(n < estate->num_waiting_nodes);
+
+	estate->waiting_nodes[n] =
+		estate->waiting_nodes[--estate->num_waiting_nodes];
 
 	/* We no longer need any asynchronous events. */
-	estate->es_total_async_events -= planstate->n_async_events;
+	estate->total_events -= planstate->n_async_events;
 	planstate->n_async_events = 0;
 
 	/*
@@ -234,18 +287,18 @@ ExecAsyncDoesNotNeedWait(PlanState *planstate)
 	 * assumes we actually did register some events at one point, because we
 	 * needed to wait at some point and we don't any more.
 	 */
-	if (estate->es_wait_event_set != NULL)
+	if (estate->wait_event_set != NULL)
 	{
-		FreeWaitEventSet(estate->es_wait_event_set);
-		estate->es_wait_event_set = NULL;
+		FreeWaitEventSet(estate->wait_event_set);
+		estate->wait_event_set = NULL;
 	}
 }
 
 /*
  * Give per-nodetype function a chance to register wait events.
  */
-static void
-ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+static bool
+ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode config_mode)
 {
 	switch (nodeTag(planstate))
 	{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index b7ac08e..3590ab1 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -139,6 +139,7 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 	PlanState  *result;
 	List	   *subps;
 	ListCell   *l;
+	int			this_node_id = estate->next_node_id++;
 
 	/*
 	 * do nothing when we get to the end of a leaf on tree.
@@ -344,6 +345,10 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 	/* Set parent pointer. */
 	result->parent = parent;
 
+	/* Set this node id and that of the right sibling */
+	result->node_id = this_node_id;
+	result->right_node_id = estate->next_node_id;
+
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
 	 * a separate list for us.
@@ -374,9 +379,13 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
  *		Invoke the given node's dispatch function.
  * ----------------------------------------------------------------
  */
-void
+
+inline void
 ExecDispatchNode(PlanState *node)
 {
+	if (node->result_ready)
+		return;
+
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -559,6 +568,8 @@ void
 ExecExecuteNode(PlanState *node)
 {
 	node->result_ready = false;
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
 	ExecDispatchNode(node);
 }
 
@@ -569,15 +580,18 @@ ExecExecuteNode(PlanState *node)
  *		Get the next tuple from the given node.  If the node is
  *		asynchronous, wait for a tuple to be ready before
  *		returning.
- * ----------------------------------------------------------------
+ *      The given node works as the termination node of an asynchronous
+ *      execution subtree and every subtree should have an individual context.
+ *      ----------------------------------------------------------------
  */
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
 	CHECK_FOR_INTERRUPTS();
 
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
+	/* Return unconsumed result if any */
+	if (node->result_ready)
+		return ExecConsumeResult(node);
 
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
@@ -587,10 +601,7 @@ ExecProcNode(PlanState *node)
 	if (!node->result_ready)
 		ExecAsyncWaitForNode(node);
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
-
-	return (TupleTableSlot *) node->result;
+	return ExecConsumeResult(node);
 }
 
 
@@ -848,6 +859,8 @@ ExecEndNode(PlanState *node)
 bool
 ExecShutdownNode(PlanState *node)
 {
+	bool ret;
+
 	if (node == NULL)
 		return false;
 
@@ -860,5 +873,7 @@ ExecShutdownNode(PlanState *node)
 			break;
 	}
 
-	return planstate_tree_walker(node, ExecShutdownNode, NULL);
+	ret = planstate_tree_walker(node, ExecShutdownNode, NULL);
+
+	return ret;
 }
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 095d40b..69d616b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -128,6 +128,9 @@ ExecScan(ScanState *node,
 	ExprDoneCond isDone;
 	TupleTableSlot *resultSlot;
 
+	if (node->ps.result_ready)
+		return;
+
 	/*
 	 * Fetch data from node
 	 */
@@ -136,14 +139,25 @@ ExecScan(ScanState *node,
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * The underlying nodes don't use ExecReturnTuple. Set this flag here so
+	 * that the async-unaware/incapable children don't need to touch it
+	 * explicitly. Async-aware/capable nodes will unset it instead if needed.
+	 */
+	node->ps.result_ready = true;
+
+	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
 	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		ExecReturnTuple(&node->ps,
-						ExecScanFetch(node, accessMtd, recheckMtd));
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (node->ps.result_ready)
+			node->ps.result = (Node *) slot;
+
 		return;
 	}
 
@@ -158,7 +172,7 @@ ExecScan(ScanState *node,
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
 		{
-			ExecReturnTuple(&node->ps, resultSlot);
+			node->ps.result = (Node *) resultSlot;
 			return;
 		}
 		/* Done with that source tuple... */
@@ -184,6 +198,9 @@ ExecScan(ScanState *node,
 
 		slot = ExecScanFetch(node, accessMtd, recheckMtd);
 
+		if (!node->ps.result_ready)
+			return;
+
 		/*
 		 * if the slot returned by the accessMtd contains NULL, then it means
 		 * there is nothing more to scan so we just return an empty slot,
@@ -193,9 +210,9 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
-			else
-				ExecReturnTuple(&node->ps, slot);
+				slot = ExecClearTuple(projInfo->pi_slot);
+
+			node->ps.result = (Node *) slot;
 			return;
 		}
 
@@ -227,7 +244,7 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					ExecReturnTuple(&node->ps, resultSlot);
+					node->ps.result = (Node *) resultSlot;
 					return;
 				}
 			}
@@ -236,7 +253,7 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				ExecReturnTuple(&node->ps, slot);
+				node->ps.result = (Node *) slot;
 				return;
 			}
 		}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index e937cf8..bb90844 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -117,6 +117,14 @@ CreateExecutorState(void)
 	estate->es_param_list_info = NULL;
 	estate->es_param_exec_vals = NULL;
 
+	estate->waiting_nodes = NULL;
+	estate->num_waiting_nodes = 0;
+	estate->max_waiting_nodes = 0;
+	estate->total_events = 0;
+	estate->max_events = 0;
+	estate->wait_event_set = NULL;
+	estate->next_node_id = 1;
+
 	estate->es_query_cxt = qcontext;
 
 	estate->es_tupleTable = NIL;
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 0ca86d9..ef1ce9c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -124,9 +124,10 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 void
 ExecSeqScan(SeqScanState *node)
 {
-	return ExecScan((ScanState *) node,
-					(ExecScanAccessMtd) SeqNext,
-					(ExecScanRecheckMtd) SeqRecheck);
+	ExecScan((ScanState *) node,
+			 (ExecScanAccessMtd) SeqNext,
+			 (ExecScanRecheckMtd) SeqRecheck);
+
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index 38b37a1..f1c748b 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -15,6 +15,13 @@
 
 #include "nodes/execnodes.h"
 
+typedef enum AsyncConfigMode
+{
+	ASYNCCONF_MODIFY,
+	ASYNCCONF_TRY_ADD,
+	ASYNCCONF_FORCE_ADD
+} AsyncConfigMode;
+
 extern void ExecAsyncWaitForNode(PlanState *planstate);
 extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
 	bool reinit);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7abc361..c1ef2ab 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -231,14 +231,22 @@ extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
 /* Convenience function to set a node's result to a TupleTableSlot. */
-static inline void
-ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
-{
-	Assert(!node->result_ready);
-	node->result = (Node *) slot;
-	node->result_ready = true;
+#define ExecReturnTuple(node, slot) \
+{ \
+	Assert(!(node)->result_ready);	\
+	(node)->result = (Node *) (slot);	\
+	(node)->result_ready = true; \
 }
 
+/* Convenience function to retrieve a node's result. */
+#define ExecConsumeResult(node) \
+( \
+    Assert((node)->result_ready), \
+    Assert((node)->result == NULL || IsA((node)->result, TupleTableSlot)), \
+    (node)->result_ready = false, \
+	(TupleTableSlot *) node->result)
+
+
 /*
  * prototypes from functions in execQual.c
  */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 76e36a2..b72decc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -383,12 +383,14 @@ typedef struct EState
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
 	/* Asynchronous execution support */
-	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
-	int			es_num_waiting_nodes;	/* # of waiters in array */
-	int			es_max_waiting_nodes;	/* # of allocated entries */
-	int			es_total_async_events;	/* total of per-node n_async_events */
-	int			es_max_async_events;	/* # supported by event set */
-	struct WaitEventSet *es_wait_event_set;
+	struct PlanState **waiting_nodes;	/* array of waiting nodes */
+	int			num_waiting_nodes;		/* # of waiters in array */
+	int			max_waiting_nodes;		/* # of allocated entries */
+	int			total_events;			/* total of per-node n_async_events */
+	int			max_events;				/* # supported by event set */
+	struct WaitEventSet *wait_event_set;
+
+	int			next_node_id;			/* node id for the next plan state */
 
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
@@ -1038,6 +1040,15 @@ typedef struct PlanState
 								 * nodes point to one EState for the whole
 								 * top-level plan */
 
+	/*
+	 * node_id and right_node_id represents ancestor-descendant relationship
+	 * by nested set model. The ids are in depth-first order and that of all
+	 * the descendants of a node are between node_id and right_node_id - 1 of
+	 * that node.
+	 */
+	int			node_id;		/* node id according to nested set model */
+	int			right_node_id;	/* node id of the right sibling */
+
 	struct PlanState *parent;	/* node which will receive tuples from us */
 	bool		result_ready;	/* true if result is ready */
 	Node	   *result;			/* result, most often TupleTableSlot */
@@ -1075,6 +1086,9 @@ typedef struct PlanState
 								 * functions in targetlist */
 } PlanState;
 
+/* Macros applied on PlanStates */
+#define IsParent(p, d) ((p)->node_id <= (d)->node_id && (d)->node_id < (p)->right_node_id)
+
 /* ----------------
  *	these are defined to avoid confusion problems with "left"
  *	and "right" and "inner" and "outer".  The convention is that
-- 
1.8.3.1

0005-Add-new-fdwroutine-AsyncConfigureWait-and-ShutdownFo.patchtext/x-patch; charset=us-asciiDownload
From 13024173b30e46eba18496dcfd59308fc6aa7837 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 17:25:30 +0900
Subject: [PATCH 5/7] Add new fdwroutine AsyncConfigureWait and
 ShutdownForeignScan.

Async-capable nodes should handle AsyncConfigureWait and
ExecShutdownNode callbacks. This patch adds entries for FDWs in the
two functions and adds corresponding FdwRoutine entries.
---
 src/backend/executor/execAsync.c    | 14 ++++++++++++--
 src/backend/executor/execProcnode.c |  9 +++++++++
 src/include/foreign/fdwapi.h        |  8 ++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 6da7ef2..578d70f 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -25,6 +25,7 @@
 
 #include "executor/execAsync.h"
 #include "executor/executor.h"
+#include "foreign/fdwapi.h"
 #include "storage/latch.h"
 
 #define	EVENT_BUFFER_SIZE		16
@@ -302,8 +303,17 @@ ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode config_mode)
 {
 	switch (nodeTag(planstate))
 	{
-		/* XXX: Add calls to per-nodetype handlers here. */
-		default:
+		/* Add calls to per-nodetype handlers here. */
+ 	case T_ForeignScanState:
+ 		{
+			ForeignScanState *node = (ForeignScanState *) planstate;
+			if (node->fdwroutine->AsyncConfigureWait)
+				return node->fdwroutine->AsyncConfigureWait(node, config_mode);
+		}
+		break;
+	default:
 			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
 	}
+
+	return false;
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3590ab1..cef262b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -869,6 +870,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+			{
+				ForeignScanState *fsstate = (ForeignScanState *)node;
+				FdwRoutine *fdwroutine = fsstate->fdwroutine;
+				if (fdwroutine->ShutdownForeignScan)
+					fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+			}
+			break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..8de44dd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "executor/execAsync.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
@@ -154,6 +155,9 @@ typedef void (*InitializeWorkerForeignScan_function) (ForeignScanState *node,
 typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
+typedef bool (*AsyncConfigureWait_function) (ForeignScanState *node,
+											 AsyncConfigMode config_mode);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -224,6 +228,10 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	AsyncConfigureWait_function AsyncConfigureWait;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
1.8.3.1

0006-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From adc4d813c48e88f2fc60bfc2f05e766abc777114 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 16:15:23 +0900
Subject: [PATCH 6/7] Make postgres_fdw async-capable

It sends the next FETCH just after the previous result is received and
returns !result_ready to the caller. This reduces the time to wait a
result for every fetch command. Multiple node on the same connection
are properly arbitrated.
---
 contrib/postgres_fdw/connection.c              |  81 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  34 +-
 contrib/postgres_fdw/postgres_fdw.c            | 511 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   4 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 5 files changed, 521 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8ca1c1c..0665d54 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -48,6 +48,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -63,6 +64,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -74,31 +76,17 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
-
+	
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
@@ -121,11 +109,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -138,8 +123,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+	
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -176,6 +192,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d7747cc..7d94b5b 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5538,27 +5538,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 931bcfd..923a19e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -50,6 +51,8 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -119,10 +122,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -133,7 +153,6 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -149,6 +168,12 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+									 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -162,11 +187,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -189,6 +214,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -199,7 +225,6 @@ typedef struct PgFdwDirectModifyState
 	bool		set_processed;	/* do we set the command es_processed? */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the update */
 	int			numParams;		/* number of parameters passed to query */
 	FmgrInfo   *param_flinfo;	/* output conversion functions for them */
 	List	   *param_exprs;	/* executable expressions for param values */
@@ -219,6 +244,7 @@ typedef struct PgFdwDirectModifyState
  */
 typedef struct PgFdwAnalyzeState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by query */
@@ -287,6 +313,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -343,6 +370,8 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
 							JoinPathExtraData *extra);
 static bool postgresRecheckForeignScan(ForeignScanState *node,
 						   TupleTableSlot *slot);
+static bool postgresAsyncConfigureWait(ForeignScanState *node,
+									   AsyncConfigMode mode);
 
 /*
  * Helper functions
@@ -363,7 +392,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -424,6 +456,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -455,6 +488,9 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for asynchronous execution */
+	routine->AsyncConfigureWait = postgresAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1298,12 +1334,20 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1363,27 +1407,122 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+	
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0);
+				if (!(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					node->ss.ps.result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}					
+
+			Assert(fsstate->async_waiting);
+
+			ExecAsyncDoesNotNeedWait((PlanState *) node);
+			fsstate->async_waiting = false;
+
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is owning this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			ExecAsyncNeedsWait((PlanState *) node, 1, false);
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			node->ss.ps.result_ready = false;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+			{
+				ExecAsyncNeedsWait((PlanState *) next_conn_owner, 1, false);
+				next_owner_state->async_waiting = true;
+			}
+		}
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			node->ss.ps.result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
@@ -1397,6 +1536,73 @@ postgresIterateForeignScan(ForeignScanState *node)
 	return slot;
 }
 
+
+static bool
+postgresAsyncConfigureWait(ForeignScanState *node, AsyncConfigMode mode)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	EState *estate = node->ss.ps.state;
+
+	if ((mode == ASYNCCONF_TRY_ADD || mode == ASYNCCONF_FORCE_ADD) &&
+		fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, node);
+		return true;
+	}
+
+	if (mode == ASYNCCONF_FORCE_ADD && fsstate->s.connspec->current_owner)
+	{
+		/*
+		 * We should somehow set a wait event. This occurs when the connection
+		 * owner does not resides in the current waiters' list. For the case,
+		 * forcibly make the connection owner finish the current request and
+		 * usurp the connection.
+		 */
+		ForeignScanState *owner = fsstate->s.connspec->current_owner;
+		PgFdwScanState *owner_state = GetPgFdwScanState(owner);
+		ForeignScanState *prev_waiter, *node_tmp;
+
+		fetch_received_data(owner);
+
+		/* find myself in the waiters' list */
+		prev_waiter = owner;
+
+		while (GetPgFdwScanState(prev_waiter)->waiter != node)
+			prev_waiter = GetPgFdwScanState(prev_waiter)->waiter;
+
+		/* Swap the previous owner and this node */
+		node_tmp = fsstate->waiter;
+
+		if (owner_state->waiter == node)
+			node_tmp = owner;
+		else
+		{
+			node_tmp = owner_state->waiter;
+			GetPgFdwScanState(prev_waiter)->waiter = owner;
+		}
+
+		owner_state->waiter = fsstate->waiter;
+		fsstate->waiter = node_tmp;
+
+		if (owner_state->last_waiter == node)
+			fsstate->last_waiter = prev_waiter;
+		else
+			fsstate->last_waiter = owner_state->last_waiter;
+		
+		request_more_data(node);
+		
+		/* now I am the connection owner */
+		AddWaitEventToSet(estate->wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, node);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * postgresReScanForeignScan
  *		Restart the scan.
@@ -1404,7 +1610,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1412,6 +1618,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1440,9 +1649,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1460,7 +1669,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1468,16 +1677,39 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	if (fsstate->async_waiting)
+	{
+		ExecAsyncDoesNotNeedWait((PlanState *) node);
+		fsstate->async_waiting = false;
+	}
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1679,7 +1911,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1760,6 +1994,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1770,14 +2006,15 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1785,10 +2022,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1826,6 +2063,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1846,14 +2085,15 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1861,10 +2101,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1902,6 +2142,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1922,14 +2164,15 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1937,10 +2180,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1987,16 +2230,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2276,7 +2519,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2331,7 +2576,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2378,8 +2626,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2497,6 +2745,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2539,6 +2788,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+		
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2816,11 +3075,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2886,47 +3145,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -2936,26 +3244,81 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
+/* 
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	Assert(!fsstate->async_waiting);
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+	}
+}
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3040,7 +3403,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3050,12 +3413,13 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3063,9 +3427,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3196,9 +3560,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn,
+						   false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3206,10 +3571,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4394,7 +4759,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 67126bc..b0c1266 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,7 +79,8 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
-
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
+	
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
 	 * relations but is set for all relations. For join relation, the name
@@ -100,6 +101,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 6f684a1..c6bbd3d 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1237,8 +1237,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
1.8.3.1

0007-Make-Append-node-async-aware.patchtext/x-patch; charset=us-asciiDownload
From b35a3049f1993f4c1521ad38afe7ea35068ab12d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 18:52:37 +0900
Subject: [PATCH 7/7] Make Append node async-aware.

Change append node to be capable to handle asynchronous children
properly. As soon as it receives !async_ready from a child, it moves
to the next child and if no child is ready, it sleeps until at least
one of them become ready.
---
 src/backend/executor/nodeAppend.c | 67 +++++++++++++++++++++++++++++++++++++--
 src/include/nodes/execnodes.h     |  2 ++
 2 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e0ce8c6..9a4063a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -121,6 +122,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 {
 	AppendState *appendstate = makeNode(AppendState);
 	PlanState **appendplanstates;
+	bool	   *finished;
 	int			nplans;
 	int			i;
 	ListCell   *lc;
@@ -134,14 +136,17 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	nplans = list_length(node->appendplans);
 
 	appendplanstates = (PlanState **) palloc0(nplans * sizeof(PlanState *));
-
+	finished = (bool *) palloc0(nplans * sizeof(bool));
+	
 	/*
 	 * create new AppendState for our append node
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
+	appendstate->as_finished = finished;
 	appendstate->as_nplans = nplans;
+	appendstate->as_async = ((eflags & EXEC_FLAG_BACKWARD) == 0);
 
 	/*
 	 * Miscellaneous initialization
@@ -194,6 +199,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 void
 ExecAppend(AppendState *node)
 {
+	int stopplan = node->as_whichplan;
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -207,7 +214,36 @@ ExecAppend(AppendState *node)
 		/*
 		 * get a tuple from the subplan
 		 */
-		result = ExecProcNode(subnode);
+		Assert(!node->as_finished[node->as_whichplan]);
+
+		if (!subnode->result_ready)
+			ExecExecuteNode(subnode);
+
+		if (!subnode->result_ready)
+		{
+			if (node->as_async)
+			{
+				/* Move to the next living node */
+				do
+				{
+					node->as_whichplan = 
+						(node->as_whichplan + 1) %  node->as_nplans;
+				} while (node->as_whichplan != stopplan &&
+						 node->as_finished[node->as_whichplan]);
+
+				/* No node is ready yet, return as not-ready */
+				if (node->as_whichplan == stopplan)
+					return;
+
+				/* Try the next node */
+				continue;
+			}
+
+			/* If not async, immediately wait for this subnode */
+			ExecAsyncWaitForNode(subnode);
+		}				
+
+		result = ExecConsumeResult((PlanState *) subnode);
 
 		if (!TupIsNull(result))
 		{
@@ -220,6 +256,31 @@ ExecAppend(AppendState *node)
 			return;
 		}
 
+		if (node->as_async)
+		{
+			node->as_finished[node->as_whichplan] = true;
+			stopplan = node->as_whichplan;
+
+			/* Find the next living subnode */
+			do
+			{
+				node->as_whichplan =
+					(node->as_whichplan + 1) % node->as_nplans;
+			} while (node->as_whichplan != stopplan &&
+					 node->as_finished[node->as_whichplan]);
+
+			if (node->as_whichplan != stopplan)
+			{
+				stopplan = node->as_whichplan;
+				continue;
+			}
+
+			/* All subnodes are exhausted. Finish this node. */
+			ExecReturnTuple(&node->ps,
+							ExecClearTuple(node->ps.ps_ResultTupleSlot));
+			return;
+		}
+
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
@@ -277,6 +338,8 @@ ExecReScanAppend(AppendState *node)
 	{
 		PlanState  *subnode = node->appendplans[i];
 
+		node->as_finished[i] = false;
+
 		/*
 		 * ExecReScan doesn't know about my subplans, so I have to do
 		 * changed-parameter signaling myself.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b72decc..b0a86c5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1177,6 +1177,8 @@ typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
+	bool		as_async;		/* true to allow async execution */
+	bool	   *as_finished;	/* array of the running state of subplans */
 	int			as_nplans;
 	int			as_whichplan;
 } AppendState;
-- 
1.8.3.1

#50Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#49)
1 attachment(s)
Re: asynchronous and vectorized execution

The previous patch set doesn't accept --enable-cassert. The
attached additional one fixes it. It theoretically won't give
degradation but I'll measure the performance change.

At Thu, 21 Jul 2016 18:50:07 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160721.185007.268388411.horiguchi.kyotaro@lab.ntt.co.jp>

Hello,

At Tue, 12 Jul 2016 11:42:55 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160712.114255.156540680.horiguchi.kyotaro@lab.ntt.co.jp>
After some refactoring, degradation for a simple seqscan is
reduced to 1.4% and that of "Append(SeqScan())" is reduced to
1.7%. The gains are the same to the previous measurement. Scale
has been changed from previous measurement in some test cases.

t0- (SeqScan()) (2 parallel)
pl- (Append(4 * SeqScan()))
pf0 (Append(4 * ForeignScan())) all ForeignScans are on the same connection.
pf1 (Append(4 * ForeignScan())) all ForeignScans have their own connections.

patched-O2 time(ms) stddev(ms) gain from unpatched (%)
t0 4121.27 1.1 -1.44
pl 1757.41 0.94 -1.73
pf0 6458.99 192.4 20.26
pf1 1747.4 24.81 78.39

unpatched-O2
t0 4062.6 1.95
pl 1727.45 9.41
pf0 8100.47 24.51
pf1 8086.52 33.53

Addition to the aboves, I will try reentrant ExecAsyncWaitForNode
or something.

After some consideration, I found that ExecAsyncWaitForNode
cannot be reentrant because it means that the control goes into
async-unaware nodes while having not-ready nodes, that is
inconsistent state. To inhibit such reentering, I allocated node
identifiers in depth-first order so that ascendant-descendant
relationship can be checked (nested-set model) in simple way and
call ExecAsyncConfigureWait only for the descendant nodes of the
parameter planstate.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0008-Change-two-macros-into-inline-functions.patchtext/x-patch; charset=us-asciiDownload
From 2a6a95a038948a7a4384f44ef99a9a454175a47c Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Fri, 22 Jul 2016 17:07:34 +0900
Subject: [PATCH 8/8] Change two macros into inline functions.

ExecConsumeResult cannot have Assertion in the form of macro. So this
patch alters it into a function. ExecReturnTuple is also changed for
the reason of uniformity. This might reduce performance (it
theoretically won't be happen but I believe I saw it..)
---
 src/include/executor/executor.h | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c1ef2ab..8e55b54 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -231,20 +231,25 @@ extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
 /* Convenience function to set a node's result to a TupleTableSlot. */
-#define ExecReturnTuple(node, slot) \
-{ \
-	Assert(!(node)->result_ready);	\
-	(node)->result = (Node *) (slot);	\
-	(node)->result_ready = true; \
+static inline void ExecReturnTuple(PlanState *node, TupleTableSlot *slot);
+static inline void
+ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
+{
+	Assert(!(node)->result_ready);
+	(node)->result = (Node *) (slot);
+	(node)->result_ready = true;
 }
 
 /* Convenience function to retrieve a node's result. */
-#define ExecConsumeResult(node) \
-( \
-    Assert((node)->result_ready), \
-    Assert((node)->result == NULL || IsA((node)->result, TupleTableSlot)), \
-    (node)->result_ready = false, \
-	(TupleTableSlot *) node->result)
+static inline TupleTableSlot *ExecConsumeResult(PlanState *node);
+static inline TupleTableSlot *
+ExecConsumeResult(PlanState *node)
+{
+    Assert(node->result_ready);
+    Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+    node->result_ready = false;
+	return (TupleTableSlot *) node->result;
+}
 
 
 /*
-- 
1.8.3.1

#51Amit Khandekar
amitdkhan.pg@gmail.com
In reply to: Kyotaro HORIGUCHI (#49)
Re: asynchronous and vectorized execution

On 21 July 2016 at 15:20, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp

wrote:

After some consideration, I found that ExecAsyncWaitForNode
cannot be reentrant because it means that the control goes into
async-unaware nodes while having not-ready nodes, that is
inconsistent state. To inhibit such reentering, I allocated node
identifiers in depth-first order so that ascendant-descendant
relationship can be checked (nested-set model) in simple way and
call ExecAsyncConfigureWait only for the descendant nodes of the
parameter planstate.

We have estate->waiting_nodes containing a mix of async-aware and
non-async-aware nodes. I was thinking, an asynchrony tree would have only
async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
Somehow, if we restrict the bubbling up of events only upto the root of the
asynchrony subtree, do you think we can simplify some of the complexities ?
For e.g. ExecAsyncWaitForNode() has become a bit complex seemingly because
it has to handle non-async-nodes also, and that's the reason I believe you
have introduced modes such as ASYNCCONF_FORCE_ADD.

Show quoted text

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#52Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Amit Khandekar (#51)
Re: asynchronous and vectorized execution

Thank you for the comment.

At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in <CAJ3gD9ek4Y4SGTSuc_pzkGYwLMbrc9QOM7m1D8bj99JNW16o0g@mail.gmail.com>

On 21 July 2016 at 15:20, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp

wrote:

After some consideration, I found that ExecAsyncWaitForNode
cannot be reentrant because it means that the control goes into
async-unaware nodes while having not-ready nodes, that is
inconsistent state. To inhibit such reentering, I allocated node
identifiers in depth-first order so that ascendant-descendant
relationship can be checked (nested-set model) in simple way and
call ExecAsyncConfigureWait only for the descendant nodes of the
parameter planstate.

We have estate->waiting_nodes containing a mix of async-aware and
non-async-aware nodes. I was thinking, an asynchrony tree would have only
async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
Somehow, if we restrict the bubbling up of events only upto the root of the
asynchrony subtree, do you think we can simplify some of the complexities ?

The current code prohibiting regsitration of nodes outside the
current subtree to avoid the reentring-disaster.

Indeed leaving the "waiting node" mark or something like on every
root node at the first visit will enable the propagation to stop
upto the root of any async-subtree. Neverheless, when an
async-child in an inactive async-root fires, the new tuple is
loaded but is not consumed then the succeeding firing on the same
child leads to a dead-lock (without result queueing). However,
that can be avoided if ExecAsyncConfigureWait doesn't register
nodes in ready state.

On the other hand, any two or more asynchronous nodes can share a
syncronization object. For instance, multiple postgres_fdw scan
node can share one server connection and only one of them can get
into waitable state at once. If no async-child in the current
async subtree is waitable, it must be stuck. So I think it is
crucial for ExecAsyncWaitForNode to force at least one child *in
the current async subtree* to get into waiting state for such
situation. The ascendant-descendant relationship is necessary to
do that anyway.

Since we should have the node-id to detect ascendant-descendant
relationship anyway and finally should restrict async-nodes with
it, activating only descendant node from the first would make the
things rather simple than avoiding possible dead-lock laster as
described above.

# It is implemented as per-subtree waiting-node list but it was
# fragile and too ugly..

For e.g. ExecAsyncWaitForNode() has become a bit complex seemingly because
it has to handle non-async-nodes also, and that's the reason I believe you
have introduced modes such as ASYNCCONF_FORCE_ADD.

As explained above, the ASYNCCONF_FORCE_ADD is not for
non-async-nodes, but for sets of async nodes that share a
synchronization object. We could let ExecAsyncConfigureWait force
acquire async-object from the first, but it in turn causes
possiblly unnecessary transfer of a sync-object among the nodes
sharing it.

I wish the above sentsnces are readable enough, but any questions
are welcome even the meaning of a sentence.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#53Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#52)
7 attachment(s)
Re: asynchronous and vectorized execution

Hello,

I considered applying the async infrastructure onto nodeGather,
but since parallel workers hardly make Gather (or the leader)
wait, it's really useless at least for simple cases. Furthermore,
as several people may have said before, being defferent from
foreign scans, gather (or other kinds of parallel) nodes usually
have several workers and will have up to two digit nubmers at the
most even on so-called many-core boxes. I finally gave up
applying this to nodeGather.

As the result, the attached patchset is functionally the same
with the last version but replace misused Assert with
AssertMacro.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0003-Lightweight-framework-for-waiting-for-events.patchtext/x-patch; charset=us-asciiDownload
From a66593db87c6b228f9906be2ef72c38df942350d Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Mon, 9 May 2016 11:48:11 -0400
Subject: [PATCH 3/7] Lightweight framework for waiting for events.

---
 src/backend/executor/Makefile       |   4 +-
 src/backend/executor/execAsync.c    | 256 ++++++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c |  82 ++++++++----
 src/include/executor/execAsync.h    |  23 ++++
 src/include/executor/executor.h     |   2 +
 src/include/nodes/execnodes.h       |  10 ++
 6 files changed, 352 insertions(+), 25 deletions(-)
 create mode 100644 src/backend/executor/execAsync.c
 create mode 100644 src/include/executor/execAsync.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 51edd4c..0675b01 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -12,8 +12,8 @@ subdir = src/backend/executor
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = execAmi.o execCurrent.o execGrouping.o execIndexing.o execJunk.o \
-       execMain.o execParallel.o execProcnode.o execQual.o \
+OBJS = execAmi.o execAsync.o execCurrent.o execGrouping.o execIndexing.o \
+       execJunk.o execMain.o execParallel.o execProcnode.o execQual.o \
        execScan.o execTuples.o \
        execUtils.o functions.o instrument.o nodeAppend.o nodeAgg.o \
        nodeBitmapAnd.o nodeBitmapOr.o \
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
new file mode 100644
index 0000000..20601fa
--- /dev/null
+++ b/src/backend/executor/execAsync.c
@@ -0,0 +1,256 @@
+/*-------------------------------------------------------------------------
+ *
+ * execAsync.c
+ *	  Support routines for asynchronous execution.
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * This file contains routines that are intended to asynchronous
+ * execution; that is, suspending an executor node until some external
+ * event occurs, or until one of its child nodes produces a tuple.
+ * This allows the executor to avoid blocking on a single external event,
+ * such as a file descriptor waiting on I/O, or a parallel worker which
+ * must complete work elsewhere in the plan tree, when there might at the
+ * same time be useful computation that could be accomplished in some
+ * other part of the plan tree.
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execParallel.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execAsync.h"
+#include "executor/executor.h"
+#include "storage/latch.h"
+
+#define	EVENT_BUFFER_SIZE		16
+
+static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+
+void
+ExecAsyncWaitForNode(PlanState *planstate)
+{
+	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
+	PlanState  *callbacks[EVENT_BUFFER_SIZE];
+	int			ncallbacks = 0;
+	EState *estate = planstate->state;
+
+	while (!planstate->result_ready)
+	{
+		bool	reinit = (estate->es_wait_event_set == NULL);
+		int		n;
+		int		noccurred;
+
+		if (reinit)
+		{
+			/*
+			 * Allow for a few extra events without reinitializing.  It
+			 * doesn't seem worth the complexity of doing anything very
+			 * aggressive here, because plans that depend on massive numbers
+			 * of external FDs are likely to run afoul of kernel limits anyway.
+			 */
+			estate->es_max_async_events = estate->es_total_async_events + 16;
+			estate->es_wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt,
+								   estate->es_max_async_events);
+		}
+
+		/* Give each waiting node a chance to add or modify events. */
+		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+
+		/* Wait for at least one event to occur. */
+		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+									 occurred_event, EVENT_BUFFER_SIZE);
+		Assert(noccurred > 0);
+
+		/*
+		 * Loop over the occurred events and make a list of nodes that need
+		 * a callback.  The waiting nodes should have registered their wait
+		 * events with user_data pointing back to the node.
+		 */
+		for (n = 0; n < noccurred; ++n)
+		{
+			WaitEvent  *w = &occurred_event[n];
+			PlanState  *ps = w->user_data;
+
+			callbacks[ncallbacks++] = ps;
+		}
+
+		/*
+		 * Initially, this loop will call the node-type-specific function for
+		 * each node for which an event occurred.  If any of those nodes
+		 * produce a result, its parent enters the set of nodes that are
+		 * pending for a callback.  In this way, when a result becomes
+		 * available in a leaf of the plan tree, it can bubble upwards towards
+		 * the root as far as necessary.
+		 */
+		while (ncallbacks > 0)
+		{
+			int		i,
+					j;
+
+			/* Loop over all callbacks. */
+			for (i = 0; i < ncallbacks; ++i)
+			{
+				/* Skip if NULL. */
+				if (callbacks[i] == NULL)
+					continue;
+
+				/*
+				 * Remove any duplicates.  O(n) may not seem good, but it
+				 * should hopefully be OK as long as EVENT_BUFFER_SIZE is
+				 * not too large.
+				 */
+				for (j = i + 1; j < ncallbacks; ++j)
+					if (callbacks[i] == callbacks[j])
+						callbacks[j] = NULL;
+
+				/* Dispatch to node-type-specific code. */
+				ExecDispatchNode(callbacks[i]);
+
+				/*
+				 * If there's now a tuple ready, we must dispatch to the
+				 * parent node; otherwise, there's nothing more to do.
+				 */
+				if (callbacks[i]->result_ready)
+					callbacks[i] = callbacks[i]->parent;
+				else
+					callbacks[i] = NULL;
+			}
+
+			/* Squeeze out NULLs. */
+			for (i = 0, j = 0; j < ncallbacks; ++j)
+				if (callbacks[j] != NULL)
+					callbacks[i++] = callbacks[j];
+			ncallbacks = i;
+		}
+	}
+}
+
+/*
+ * An executor node should call this function to signal that it needs to wait
+ * on one more or events that can be registered on a WaitEventSet.  nevents
+ * should be the maximum number of events that it will wish to register.
+ * reinit should be true if the node can't reuse the WaitEventSet it most
+ * recently initialized, for example because it needs to drop a wait event
+ * from the set.
+ */
+void
+ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
+{
+	EState *estate = planstate->state;
+
+	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
+
+	/*
+	 * If this node is not already present in the array of waiting nodes,
+	 * then add it.  If that array hasn't been allocated or is full, this may
+	 * require (re)allocating it.
+	 */
+	if (planstate->n_async_events == 0)
+	{
+		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		{
+			int		newmax;
+
+			if (estate->es_max_waiting_nodes == 0)
+			{
+				newmax = 16;
+				estate->es_waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt, newmax);
+			}
+			else
+			{
+				newmax = estate->es_max_waiting_nodes * 2;
+				estate->es_waiting_nodes =
+					repalloc(estate->es_waiting_nodes,
+							 newmax * sizeof(PlanState *));
+			}
+			estate->es_max_waiting_nodes = newmax;
+		}
+		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+	}
+
+	/* Adjust per-node and per-estate totals. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = nevents;
+	estate->es_total_async_events += planstate->n_async_events;
+
+	/*
+	 * If a WaitEventSet has already been created, we need to discard it and
+	 * start again if the user passed reinit = true, or if the total number of
+	 * required events exceeds the supported number.
+	 */
+	if (estate->es_wait_event_set != NULL && (reinit ||
+		estate->es_total_async_events > estate->es_max_async_events))
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * If an executor node no longer needs to wait, it should call this function
+ * to report that fact.
+ */
+void
+ExecAsyncDoesNotNeedWait(PlanState *planstate)
+{
+	int		n;
+	EState *estate = planstate->state;
+
+	if (planstate->n_async_events <= 0)
+		return;
+
+	/*
+	 * Remove the node from the list of waiting nodes.  (Is a linear search
+	 * going to be a problem here?  I think probably not.)
+	 */
+	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	{
+		if (estate->es_waiting_nodes[n] == planstate)
+		{
+			estate->es_waiting_nodes[n] =
+				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+			break;
+		}
+	}
+
+	/* We should always find ourselves in the array. */
+	Assert(n < estate->es_num_waiting_nodes);
+
+	/* We no longer need any asynchronous events. */
+	estate->es_total_async_events -= planstate->n_async_events;
+	planstate->n_async_events = 0;
+
+	/*
+	 * The next wait will need to rebuild the WaitEventSet, because whatever
+	 * events we registered are gone now.  It's probably OK that this code
+	 * assumes we actually did register some events at one point, because we
+	 * needed to wait at some point and we don't any more.
+	 */
+	if (estate->es_wait_event_set != NULL)
+	{
+		FreeWaitEventSet(estate->es_wait_event_set);
+		estate->es_wait_event_set = NULL;
+	}
+}
+
+/*
+ * Give per-nodetype function a chance to register wait events.
+ */
+static void
+ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+{
+	switch (nodeTag(planstate))
+	{
+		/* XXX: Add calls to per-nodetype handlers here. */
+		default:
+			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
+	}
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3f2ebff..b7ac08e 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -77,6 +77,7 @@
  */
 #include "postgres.h"
 
+#include "executor/execAsync.h"
 #include "executor/executor.h"
 #include "executor/nodeAgg.h"
 #include "executor/nodeAppend.h"
@@ -368,24 +369,14 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 
 
 /* ----------------------------------------------------------------
- *		ExecProcNode
+ *		ExecDispatchNode
  *
- *		Execute the given node to return a(nother) tuple.
+ *		Invoke the given node's dispatch function.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
-ExecProcNode(PlanState *node)
+void
+ExecDispatchNode(PlanState *node)
 {
-	TupleTableSlot *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -539,22 +530,67 @@ ExecProcNode(PlanState *node)
 
 		default:
 			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
 			break;
 	}
 
-	/* We don't support asynchronous execution yet. */
-	Assert(node->result_ready);
+	if (node->instrument)
+	{
+		double	nTuples = 0.0;
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+		if (node->result_ready && node->result != NULL &&
+			IsA(node->result, TupleTableSlot))
+			nTuples = 1.0;
 
-	result = (TupleTableSlot *) node->result;
+		InstrStopNode(node->instrument, nTuples);
+	}
+}
 
-	if (node->instrument)
-		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
-	return result;
+/* ----------------------------------------------------------------
+ *		ExecExecuteNode
+ *
+ *		Request the next tuple from the given node.  Note that
+ *		if the node supports asynchrony, result_ready may not be
+ *		set on return (use ExecProcNode if you need that, or call
+ *		ExecAsyncWaitForNode).
+ * ----------------------------------------------------------------
+ */
+void
+ExecExecuteNode(PlanState *node)
+{
+	node->result_ready = false;
+	ExecDispatchNode(node);
+}
+
+
+/* ----------------------------------------------------------------
+ *		ExecProcNode
+ *
+ *		Get the next tuple from the given node.  If the node is
+ *		asynchronous, wait for a tuple to be ready before
+ *		returning.
+ * ----------------------------------------------------------------
+ */
+TupleTableSlot *
+ExecProcNode(PlanState *node)
+{
+	CHECK_FOR_INTERRUPTS();
+
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	ExecDispatchNode(node);
+
+	if (!node->result_ready)
+		ExecAsyncWaitForNode(node);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	return (TupleTableSlot *) node->result;
 }
 
 
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
new file mode 100644
index 0000000..38b37a1
--- /dev/null
+++ b/src/include/executor/execAsync.h
@@ -0,0 +1,23 @@
+/*--------------------------------------------------------------------
+ * execAsync.h
+ *		Support functions for asynchronous query execution
+ *
+ * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *		src/include/executor/execAsync.h
+ *--------------------------------------------------------------------
+ */
+
+#ifndef EXECASYNC_H
+#define EXECASYNC_H
+
+#include "nodes/execnodes.h"
+
+extern void ExecAsyncWaitForNode(PlanState *planstate);
+extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
+	bool reinit);
+extern void ExecAsyncDoesNotNeedWait(PlanState *planstate);
+
+#endif   /* EXECASYNC_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 1eb09d8..7abc361 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -223,6 +223,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
 			 int eflags);
+extern void ExecDispatchNode(PlanState *node);
+extern void ExecExecuteNode(PlanState *node);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index ff6c453..76e36a2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -382,6 +382,14 @@ typedef struct EState
 	ParamListInfo es_param_list_info;	/* values of external params */
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
+	/* Asynchronous execution support */
+	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
+	int			es_num_waiting_nodes;	/* # of waiters in array */
+	int			es_max_waiting_nodes;	/* # of allocated entries */
+	int			es_total_async_events;	/* total of per-node n_async_events */
+	int			es_max_async_events;	/* # supported by event set */
+	struct WaitEventSet *es_wait_event_set;
+
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
 
@@ -1034,6 +1042,8 @@ typedef struct PlanState
 	bool		result_ready;	/* true if result is ready */
 	Node	   *result;			/* result, most often TupleTableSlot */
 
+	int			n_async_events;	/* # of async events we want to register */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
2.9.2

0004-Fix-async-execution-framework.patchtext/x-patch; charset=us-asciiDownload
From 2fd05ae24fb587b951b5fdc81085db4264fde58a Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 17:23:16 +0900
Subject: [PATCH 4/7] Fix async execution framework.

This commit changes some behavior of the framework and fixes some
minor bugs.
---
 src/backend/executor/execAsync.c    | 141 +++++++++++++++++++++++++-----------
 src/backend/executor/execProcnode.c |  33 ++++++---
 src/backend/executor/execScan.c     |  33 +++++++--
 src/backend/executor/execUtils.c    |   8 ++
 src/backend/executor/nodeSeqscan.c  |   7 +-
 src/include/executor/execAsync.h    |   7 ++
 src/include/executor/executor.h     |  20 +++--
 src/include/nodes/execnodes.h       |  26 +++++--
 8 files changed, 199 insertions(+), 76 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 20601fa..6da7ef2 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -29,7 +29,7 @@
 
 #define	EVENT_BUFFER_SIZE		16
 
-static void ExecAsyncConfigureWait(PlanState *planstate, bool reinit);
+static bool ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode mode);
 
 void
 ExecAsyncWaitForNode(PlanState *planstate)
@@ -37,13 +37,14 @@ ExecAsyncWaitForNode(PlanState *planstate)
 	WaitEvent	occurred_event[EVENT_BUFFER_SIZE];
 	PlanState  *callbacks[EVENT_BUFFER_SIZE];
 	int			ncallbacks = 0;
-	EState *estate = planstate->state;
+	EState     *estate = planstate->state;
 
 	while (!planstate->result_ready)
 	{
-		bool	reinit = (estate->es_wait_event_set == NULL);
+		bool	reinit = (estate->wait_event_set == NULL);
 		int		n;
 		int		noccurred;
+		bool	has_event = false;
 
 		if (reinit)
 		{
@@ -53,18 +54,68 @@ ExecAsyncWaitForNode(PlanState *planstate)
 			 * aggressive here, because plans that depend on massive numbers
 			 * of external FDs are likely to run afoul of kernel limits anyway.
 			 */
-			estate->es_max_async_events = estate->es_total_async_events + 16;
-			estate->es_wait_event_set =
-				CreateWaitEventSet(estate->es_query_cxt,
-								   estate->es_max_async_events);
+			estate->max_events = estate->total_events + 16;
+			estate->wait_event_set =
+				CreateWaitEventSet(estate->es_query_cxt, estate->max_events);
 		}
 
-		/* Give each waiting node a chance to add or modify events. */
-		for (n = 0; n < estate->es_num_waiting_nodes; ++n)
-			ExecAsyncConfigureWait(estate->es_waiting_nodes[n], reinit);
+		/*
+		 * Give each waiting node a chance to add or modify events to the
+		 * descendants of this planstate.
+		 */
+		for (n = 0; n < estate->num_waiting_nodes; ++n)
+		{
+			PlanState *node = estate->waiting_nodes[n];
+
+			/*
+			 * We assume that few nodes are async-aware and async-unaware
+			 * nodes cannot be revserse-dispatched from lower nodes that is
+			 * async-aware. Firing of an async node that is not a descendant
+			 * of the planstate will cause such reverse-diaptching to
+			 * async-aware nodes, which is unexpected behavior for them.
+			 *
+			 * For instance, consider an async-unaware Hashjoin(OUTER, INNER)
+			 * where the OUTER is running asynchronously but the Hashjoin is
+			 * waiting on the async INNER during inner-hash creation. If the
+			 * OUTER fires for the case, since anyone is waiting on it,
+			 * ExecAsyncWaitForNode finally dispatches to the Hashjoin which
+			 * is now in the middle of thing its work.
+			 */
+			if (!IsParent(planstate, node))
+				continue;
+
+			has_event |= 
+				ExecAsyncConfigureWait(node,
+					   reinit ? ASYNCCONF_TRY_ADD : ASYNCCONF_MODIFY);
+		}
+
+		if (!has_event)
+		{
+			/*
+			 * No event to wait. This occurs when all of the waiters share the
+			 * same object for sync with the nodes in other
+			 * sync-subtree. Anyway we must have at least one event to wait.
+			 */
+
+			 for (n = 0; n < estate->num_waiting_nodes; ++n)
+			 {
+				 PlanState *node = estate->waiting_nodes[n];
 
-		/* Wait for at least one event to occur. */
-		noccurred = WaitEventSetWait(estate->es_wait_event_set, -1,
+				 /* Skip if this node is not a descendant of planstate */
+				 if (!IsParent(planstate, node))
+					 continue;
+
+				 if (ExecAsyncConfigureWait(node, ASYNCCONF_FORCE_ADD))
+					 break;
+			 }
+
+			 /* Too bad. We don't have anyone to wait. Something wrong. */
+			 if (n == estate->num_waiting_nodes)
+				 ereport(ERROR,
+						 (errmsg("inconsistency in asynchronous execution")));
+		}
+
+		noccurred = WaitEventSetWait(estate->wait_event_set, -1,
 									 occurred_event, EVENT_BUFFER_SIZE);
 		Assert(noccurred > 0);
 
@@ -115,9 +166,10 @@ ExecAsyncWaitForNode(PlanState *planstate)
 
 				/*
 				 * If there's now a tuple ready, we must dispatch to the
-				 * parent node; otherwise, there's nothing more to do.
+				 * parent node up to the waiting root; otherwise, there's
+				 * nothing more to do.
 				 */
-				if (callbacks[i]->result_ready)
+				if (callbacks[i]->result_ready && callbacks[i] != planstate)
 					callbacks[i] = callbacks[i]->parent;
 				else
 					callbacks[i] = NULL;
@@ -143,7 +195,7 @@ ExecAsyncWaitForNode(PlanState *planstate)
 void
 ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
 {
-	EState *estate = planstate->state;
+	EState     *estate = planstate->state;
 
 	Assert(nevents > 0); 	/* otherwise, use ExecAsyncDoesNotNeedWait */
 
@@ -154,43 +206,45 @@ ExecAsyncNeedsWait(PlanState *planstate, int nevents, bool reinit)
 	 */
 	if (planstate->n_async_events == 0)
 	{
-		if (estate->es_max_waiting_nodes >= estate->es_num_waiting_nodes)
+		if (estate->max_waiting_nodes <= estate->num_waiting_nodes)
 		{
 			int		newmax;
 
-			if (estate->es_max_waiting_nodes == 0)
+			if (estate->max_waiting_nodes == 0)
 			{
 				newmax = 16;
-				estate->es_waiting_nodes =
-					MemoryContextAlloc(estate->es_query_cxt, newmax);
+				estate->waiting_nodes =
+					MemoryContextAlloc(estate->es_query_cxt,
+									   newmax * sizeof(PlanState *));
 			}
 			else
 			{
-				newmax = estate->es_max_waiting_nodes * 2;
-				estate->es_waiting_nodes =
-					repalloc(estate->es_waiting_nodes,
+				newmax = estate->max_waiting_nodes * 2;
+				estate->waiting_nodes =
+					repalloc(estate->waiting_nodes,
 							 newmax * sizeof(PlanState *));
 			}
-			estate->es_max_waiting_nodes = newmax;
+			estate->max_waiting_nodes = newmax;
 		}
-		estate->es_waiting_nodes[estate->es_num_waiting_nodes++] = planstate;
+		estate->waiting_nodes[estate->num_waiting_nodes++] =
+			planstate;
 	}
 
-	/* Adjust per-node and per-estate totals. */
-	estate->es_total_async_events -= planstate->n_async_events;
+	/* Adjust per-node and per-asstate totals. */
+	estate->total_events -= planstate->n_async_events;
 	planstate->n_async_events = nevents;
-	estate->es_total_async_events += planstate->n_async_events;
+	estate->total_events += planstate->n_async_events;
 
 	/*
 	 * If a WaitEventSet has already been created, we need to discard it and
 	 * start again if the user passed reinit = true, or if the total number of
 	 * required events exceeds the supported number.
 	 */
-	if (estate->es_wait_event_set != NULL && (reinit ||
-		estate->es_total_async_events > estate->es_max_async_events))
+	if (estate->wait_event_set != NULL && (reinit ||
+		estate->total_events > estate->max_events))
 	{
-		FreeWaitEventSet(estate->es_wait_event_set);
-		estate->es_wait_event_set = NULL;
+		FreeWaitEventSet(estate->wait_event_set);
+		estate->wait_event_set = NULL;
 	}
 }
 
@@ -211,21 +265,20 @@ ExecAsyncDoesNotNeedWait(PlanState *planstate)
 	 * Remove the node from the list of waiting nodes.  (Is a linear search
 	 * going to be a problem here?  I think probably not.)
 	 */
-	for (n = 0; n < estate->es_num_waiting_nodes; ++n)
+	for (n = 0; n < estate->num_waiting_nodes; ++n)
 	{
-		if (estate->es_waiting_nodes[n] == planstate)
-		{
-			estate->es_waiting_nodes[n] =
-				estate->es_waiting_nodes[--estate->es_num_waiting_nodes];
+		if (estate->waiting_nodes[n] == planstate)
 			break;
-		}
 	}
 
 	/* We should always find ourselves in the array. */
-	Assert(n < estate->es_num_waiting_nodes);
+	Assert(n < estate->num_waiting_nodes);
+
+	estate->waiting_nodes[n] =
+		estate->waiting_nodes[--estate->num_waiting_nodes];
 
 	/* We no longer need any asynchronous events. */
-	estate->es_total_async_events -= planstate->n_async_events;
+	estate->total_events -= planstate->n_async_events;
 	planstate->n_async_events = 0;
 
 	/*
@@ -234,18 +287,18 @@ ExecAsyncDoesNotNeedWait(PlanState *planstate)
 	 * assumes we actually did register some events at one point, because we
 	 * needed to wait at some point and we don't any more.
 	 */
-	if (estate->es_wait_event_set != NULL)
+	if (estate->wait_event_set != NULL)
 	{
-		FreeWaitEventSet(estate->es_wait_event_set);
-		estate->es_wait_event_set = NULL;
+		FreeWaitEventSet(estate->wait_event_set);
+		estate->wait_event_set = NULL;
 	}
 }
 
 /*
  * Give per-nodetype function a chance to register wait events.
  */
-static void
-ExecAsyncConfigureWait(PlanState *planstate, bool reinit)
+static bool
+ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode config_mode)
 {
 	switch (nodeTag(planstate))
 	{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index b7ac08e..3590ab1 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -139,6 +139,7 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 	PlanState  *result;
 	List	   *subps;
 	ListCell   *l;
+	int			this_node_id = estate->next_node_id++;
 
 	/*
 	 * do nothing when we get to the end of a leaf on tree.
@@ -344,6 +345,10 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 	/* Set parent pointer. */
 	result->parent = parent;
 
+	/* Set this node id and that of the right sibling */
+	result->node_id = this_node_id;
+	result->right_node_id = estate->next_node_id;
+
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
 	 * a separate list for us.
@@ -374,9 +379,13 @@ ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
  *		Invoke the given node's dispatch function.
  * ----------------------------------------------------------------
  */
-void
+
+inline void
 ExecDispatchNode(PlanState *node)
 {
+	if (node->result_ready)
+		return;
+
 	if (node->instrument)
 		InstrStartNode(node->instrument);
 
@@ -559,6 +568,8 @@ void
 ExecExecuteNode(PlanState *node)
 {
 	node->result_ready = false;
+	if (node->chgParam != NULL) /* something changed */
+		ExecReScan(node);		/* let ReScan handle this */
 	ExecDispatchNode(node);
 }
 
@@ -569,15 +580,18 @@ ExecExecuteNode(PlanState *node)
  *		Get the next tuple from the given node.  If the node is
  *		asynchronous, wait for a tuple to be ready before
  *		returning.
- * ----------------------------------------------------------------
+ *      The given node works as the termination node of an asynchronous
+ *      execution subtree and every subtree should have an individual context.
+ *      ----------------------------------------------------------------
  */
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
 	CHECK_FOR_INTERRUPTS();
 
-	/* mark any previous result as having been consumed */
-	node->result_ready = false;
+	/* Return unconsumed result if any */
+	if (node->result_ready)
+		return ExecConsumeResult(node);
 
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
@@ -587,10 +601,7 @@ ExecProcNode(PlanState *node)
 	if (!node->result_ready)
 		ExecAsyncWaitForNode(node);
 
-	/* Result should be a TupleTableSlot, unless it's NULL. */
-	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
-
-	return (TupleTableSlot *) node->result;
+	return ExecConsumeResult(node);
 }
 
 
@@ -848,6 +859,8 @@ ExecEndNode(PlanState *node)
 bool
 ExecShutdownNode(PlanState *node)
 {
+	bool ret;
+
 	if (node == NULL)
 		return false;
 
@@ -860,5 +873,7 @@ ExecShutdownNode(PlanState *node)
 			break;
 	}
 
-	return planstate_tree_walker(node, ExecShutdownNode, NULL);
+	ret = planstate_tree_walker(node, ExecShutdownNode, NULL);
+
+	return ret;
 }
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 095d40b..69d616b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -128,6 +128,9 @@ ExecScan(ScanState *node,
 	ExprDoneCond isDone;
 	TupleTableSlot *resultSlot;
 
+	if (node->ps.result_ready)
+		return;
+
 	/*
 	 * Fetch data from node
 	 */
@@ -136,14 +139,25 @@ ExecScan(ScanState *node,
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * The underlying nodes don't use ExecReturnTuple. Set this flag here so
+	 * that the async-unaware/incapable children don't need to touch it
+	 * explicitly. Async-aware/capable nodes will unset it instead if needed.
+	 */
+	node->ps.result_ready = true;
+
+	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
 	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
+		TupleTableSlot *slot;
+
 		ResetExprContext(econtext);
-		ExecReturnTuple(&node->ps,
-						ExecScanFetch(node, accessMtd, recheckMtd));
+		slot = ExecScanFetch(node, accessMtd, recheckMtd);
+		if (node->ps.result_ready)
+			node->ps.result = (Node *) slot;
+
 		return;
 	}
 
@@ -158,7 +172,7 @@ ExecScan(ScanState *node,
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
 		{
-			ExecReturnTuple(&node->ps, resultSlot);
+			node->ps.result = (Node *) resultSlot;
 			return;
 		}
 		/* Done with that source tuple... */
@@ -184,6 +198,9 @@ ExecScan(ScanState *node,
 
 		slot = ExecScanFetch(node, accessMtd, recheckMtd);
 
+		if (!node->ps.result_ready)
+			return;
+
 		/*
 		 * if the slot returned by the accessMtd contains NULL, then it means
 		 * there is nothing more to scan so we just return an empty slot,
@@ -193,9 +210,9 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
-			else
-				ExecReturnTuple(&node->ps, slot);
+				slot = ExecClearTuple(projInfo->pi_slot);
+
+			node->ps.result = (Node *) slot;
 			return;
 		}
 
@@ -227,7 +244,7 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					ExecReturnTuple(&node->ps, resultSlot);
+					node->ps.result = (Node *) resultSlot;
 					return;
 				}
 			}
@@ -236,7 +253,7 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				ExecReturnTuple(&node->ps, slot);
+				node->ps.result = (Node *) slot;
 				return;
 			}
 		}
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index a3bcb10..8318411 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -115,6 +115,14 @@ CreateExecutorState(void)
 	estate->es_param_list_info = NULL;
 	estate->es_param_exec_vals = NULL;
 
+	estate->waiting_nodes = NULL;
+	estate->num_waiting_nodes = 0;
+	estate->max_waiting_nodes = 0;
+	estate->total_events = 0;
+	estate->max_events = 0;
+	estate->wait_event_set = NULL;
+	estate->next_node_id = 1;
+
 	estate->es_query_cxt = qcontext;
 
 	estate->es_tupleTable = NIL;
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 0ca86d9..ef1ce9c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -124,9 +124,10 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 void
 ExecSeqScan(SeqScanState *node)
 {
-	return ExecScan((ScanState *) node,
-					(ExecScanAccessMtd) SeqNext,
-					(ExecScanRecheckMtd) SeqRecheck);
+	ExecScan((ScanState *) node,
+			 (ExecScanAccessMtd) SeqNext,
+			 (ExecScanRecheckMtd) SeqRecheck);
+
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/include/executor/execAsync.h b/src/include/executor/execAsync.h
index 38b37a1..f1c748b 100644
--- a/src/include/executor/execAsync.h
+++ b/src/include/executor/execAsync.h
@@ -15,6 +15,13 @@
 
 #include "nodes/execnodes.h"
 
+typedef enum AsyncConfigMode
+{
+	ASYNCCONF_MODIFY,
+	ASYNCCONF_TRY_ADD,
+	ASYNCCONF_FORCE_ADD
+} AsyncConfigMode;
+
 extern void ExecAsyncWaitForNode(PlanState *planstate);
 extern void ExecAsyncNeedsWait(PlanState *planstate, int nevents,
 	bool reinit);
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7abc361..97925d5 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -231,14 +231,22 @@ extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
 /* Convenience function to set a node's result to a TupleTableSlot. */
-static inline void
-ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
-{
-	Assert(!node->result_ready);
-	node->result = (Node *) slot;
-	node->result_ready = true;
+#define ExecReturnTuple(node, slot) \
+{ \
+	AssertMacro(!(node)->result_ready);	\
+	(node)->result = (Node *) (slot);	\
+	(node)->result_ready = true; \
 }
 
+/* Convenience function to retrieve a node's result. */
+#define ExecConsumeResult(node) \
+( \
+    AssertMacro((node)->result_ready), \
+    AssertMacro((node)->result == NULL || IsA((node)->result, TupleTableSlot)), \
+    (node)->result_ready = false, \
+	(TupleTableSlot *) node->result)
+
+
 /*
  * prototypes from functions in execQual.c
  */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 76e36a2..b72decc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -383,12 +383,14 @@ typedef struct EState
 	ParamExecData *es_param_exec_vals;	/* values of internal params */
 
 	/* Asynchronous execution support */
-	struct PlanState **es_waiting_nodes;		/* array of waiting nodes */
-	int			es_num_waiting_nodes;	/* # of waiters in array */
-	int			es_max_waiting_nodes;	/* # of allocated entries */
-	int			es_total_async_events;	/* total of per-node n_async_events */
-	int			es_max_async_events;	/* # supported by event set */
-	struct WaitEventSet *es_wait_event_set;
+	struct PlanState **waiting_nodes;	/* array of waiting nodes */
+	int			num_waiting_nodes;		/* # of waiters in array */
+	int			max_waiting_nodes;		/* # of allocated entries */
+	int			total_events;			/* total of per-node n_async_events */
+	int			max_events;				/* # supported by event set */
+	struct WaitEventSet *wait_event_set;
+
+	int			next_node_id;			/* node id for the next plan state */
 
 	/* Other working state: */
 	MemoryContext es_query_cxt; /* per-query context in which EState lives */
@@ -1038,6 +1040,15 @@ typedef struct PlanState
 								 * nodes point to one EState for the whole
 								 * top-level plan */
 
+	/*
+	 * node_id and right_node_id represents ancestor-descendant relationship
+	 * by nested set model. The ids are in depth-first order and that of all
+	 * the descendants of a node are between node_id and right_node_id - 1 of
+	 * that node.
+	 */
+	int			node_id;		/* node id according to nested set model */
+	int			right_node_id;	/* node id of the right sibling */
+
 	struct PlanState *parent;	/* node which will receive tuples from us */
 	bool		result_ready;	/* true if result is ready */
 	Node	   *result;			/* result, most often TupleTableSlot */
@@ -1075,6 +1086,9 @@ typedef struct PlanState
 								 * functions in targetlist */
 } PlanState;
 
+/* Macros applied on PlanStates */
+#define IsParent(p, d) ((p)->node_id <= (d)->node_id && (d)->node_id < (p)->right_node_id)
+
 /* ----------------
  *	these are defined to avoid confusion problems with "left"
  *	and "right" and "inner" and "outer".  The convention is that
-- 
2.9.2

0005-Add-new-fdwroutine-AsyncConfigureWait-and-ShutdownFo.patchtext/x-patch; charset=us-asciiDownload
From 02d5392ea87c670588ceca38afcbfd301d883e2f Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 17:25:30 +0900
Subject: [PATCH 5/7] Add new fdwroutine AsyncConfigureWait and
 ShutdownForeignScan.

Async-capable nodes should handle AsyncConfigureWait and
ExecShutdownNode callbacks. This patch adds entries for FDWs in the
two functions and adds corresponding FdwRoutine entries.
---
 src/backend/executor/execAsync.c    | 14 ++++++++++++--
 src/backend/executor/execProcnode.c |  9 +++++++++
 src/include/foreign/fdwapi.h        |  8 ++++++++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index 6da7ef2..578d70f 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -25,6 +25,7 @@
 
 #include "executor/execAsync.h"
 #include "executor/executor.h"
+#include "foreign/fdwapi.h"
 #include "storage/latch.h"
 
 #define	EVENT_BUFFER_SIZE		16
@@ -302,8 +303,17 @@ ExecAsyncConfigureWait(PlanState *planstate, AsyncConfigMode config_mode)
 {
 	switch (nodeTag(planstate))
 	{
-		/* XXX: Add calls to per-nodetype handlers here. */
-		default:
+		/* Add calls to per-nodetype handlers here. */
+ 	case T_ForeignScanState:
+ 		{
+			ForeignScanState *node = (ForeignScanState *) planstate;
+			if (node->fdwroutine->AsyncConfigureWait)
+				return node->fdwroutine->AsyncConfigureWait(node, config_mode);
+		}
+		break;
+	default:
 			elog(ERROR, "unexpected node type: %d", nodeTag(planstate));
 	}
+
+	return false;
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3590ab1..cef262b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -115,6 +115,7 @@
 #include "executor/nodeValuesscan.h"
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
+#include "foreign/fdwapi.h"
 #include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 
@@ -869,6 +870,14 @@ ExecShutdownNode(PlanState *node)
 		case T_GatherState:
 			ExecShutdownGather((GatherState *) node);
 			break;
+		case T_ForeignScanState:
+			{
+				ForeignScanState *fsstate = (ForeignScanState *)node;
+				FdwRoutine *fdwroutine = fsstate->fdwroutine;
+				if (fdwroutine->ShutdownForeignScan)
+					fdwroutine->ShutdownForeignScan((ForeignScanState *) node);
+			}
+			break;
 		default:
 			break;
 	}
diff --git a/src/include/foreign/fdwapi.h b/src/include/foreign/fdwapi.h
index e1b0d0d..8de44dd 100644
--- a/src/include/foreign/fdwapi.h
+++ b/src/include/foreign/fdwapi.h
@@ -13,6 +13,7 @@
 #define FDWAPI_H
 
 #include "access/parallel.h"
+#include "executor/execAsync.h"
 #include "nodes/execnodes.h"
 #include "nodes/relation.h"
 
@@ -154,6 +155,9 @@ typedef void (*InitializeWorkerForeignScan_function) (ForeignScanState *node,
 typedef bool (*IsForeignScanParallelSafe_function) (PlannerInfo *root,
 															 RelOptInfo *rel,
 														 RangeTblEntry *rte);
+typedef bool (*AsyncConfigureWait_function) (ForeignScanState *node,
+											 AsyncConfigMode config_mode);
+typedef void (*ShutdownForeignScan_function) (ForeignScanState *node);
 
 /*
  * FdwRoutine is the struct returned by a foreign-data wrapper's handler
@@ -224,6 +228,10 @@ typedef struct FdwRoutine
 	EstimateDSMForeignScan_function EstimateDSMForeignScan;
 	InitializeDSMForeignScan_function InitializeDSMForeignScan;
 	InitializeWorkerForeignScan_function InitializeWorkerForeignScan;
+
+	/* Support functions for asynchronous execution */
+	AsyncConfigureWait_function AsyncConfigureWait;
+	ShutdownForeignScan_function ShutdownForeignScan;
 } FdwRoutine;
 
 
-- 
2.9.2

0006-Make-postgres_fdw-async-capable.patchtext/x-patch; charset=us-asciiDownload
From 76863380dbbf589665dd44aeac828c7e2fa669c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 16:15:23 +0900
Subject: [PATCH 6/7] Make postgres_fdw async-capable

It sends the next FETCH just after the previous result is received and
returns !result_ready to the caller. This reduces the time to wait a
result for every fetch command. Multiple node on the same connection
are properly arbitrated.
---
 contrib/postgres_fdw/connection.c              |  81 ++--
 contrib/postgres_fdw/expected/postgres_fdw.out |  34 +-
 contrib/postgres_fdw/postgres_fdw.c            | 511 +++++++++++++++++++++----
 contrib/postgres_fdw/postgres_fdw.h            |   4 +-
 contrib/postgres_fdw/sql/postgres_fdw.sql      |   4 +-
 5 files changed, 521 insertions(+), 113 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 8ca1c1c..0665d54 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -48,6 +48,7 @@ typedef struct ConnCacheEntry
 								 * one level of subxact open, etc */
 	bool		have_prep_stmt; /* have we prepared any stmts in this xact? */
 	bool		have_error;		/* have any subxacts aborted in this xact? */
+	void		*storage;		/* connection specific storage */
 } ConnCacheEntry;
 
 /*
@@ -63,6 +64,7 @@ static unsigned int prep_stmt_number = 0;
 static bool xact_got_connection = false;
 
 /* prototypes of private functions */
+static ConnCacheEntry *get_connection_entry(Oid umid);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
 static void check_conn_params(const char **keywords, const char **values);
 static void configure_remote_session(PGconn *conn);
@@ -74,31 +76,17 @@ static void pgfdw_subxact_callback(SubXactEvent event,
 					   SubTransactionId parentSubid,
 					   void *arg);
 
-
 /*
- * Get a PGconn which can be used to execute queries on the remote PostgreSQL
- * server with the user's authorization.  A new connection is established
- * if we don't already have a suitable one, and a transaction is opened at
- * the right subtransaction nesting depth if we didn't do that already.
- *
- * will_prep_stmt must be true if caller intends to create any prepared
- * statements.  Since those don't go away automatically at transaction end
- * (not even on error), we need this flag to cue manual cleanup.
- *
- * XXX Note that caching connections theoretically requires a mechanism to
- * detect change of FDW objects to invalidate already established connections.
- * We could manage that by watching for invalidation events on the relevant
- * syscaches.  For the moment, though, it's not clear that this would really
- * be useful and not mere pedantry.  We could not flush any active connections
- * mid-transaction anyway.
+ * Common function to acquire or create a connection cache entry.
  */
-PGconn *
-GetConnection(UserMapping *user, bool will_prep_stmt)
+
+static ConnCacheEntry *
+get_connection_entry(Oid umid)
 {
 	bool		found;
 	ConnCacheEntry *entry;
 	ConnCacheKey key;
-
+	
 	/* First time through, initialize connection cache hashtable */
 	if (ConnectionHash == NULL)
 	{
@@ -121,11 +109,8 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		RegisterSubXactCallback(pgfdw_subxact_callback, NULL);
 	}
 
-	/* Set flag that we did GetConnection during the current transaction */
-	xact_got_connection = true;
-
 	/* Create hash key for the entry.  Assume no pad bytes in key struct */
-	key = user->umid;
+	key = umid;
 
 	/*
 	 * Find or create cached entry for requested connection.
@@ -138,8 +123,39 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 		entry->xact_depth = 0;
 		entry->have_prep_stmt = false;
 		entry->have_error = false;
+		entry->storage = NULL;
 	}
 
+	return entry;
+}
+
+/*
+ * Get a PGconn which can be used to execute queries on the remote PostgreSQL
+ * server with the user's authorization.  A new connection is established
+ * if we don't already have a suitable one, and a transaction is opened at
+ * the right subtransaction nesting depth if we didn't do that already.
+ *
+ * will_prep_stmt must be true if caller intends to create any prepared
+ * statements.  Since those don't go away automatically at transaction end
+ * (not even on error), we need this flag to cue manual cleanup.
+ *
+ * XXX Note that caching connections theoretically requires a mechanism to
+ * detect change of FDW objects to invalidate already established connections.
+ * We could manage that by watching for invalidation events on the relevant
+ * syscaches.  For the moment, though, it's not clear that this would really
+ * be useful and not mere pedantry.  We could not flush any active connections
+ * mid-transaction anyway.
+ */
+PGconn *
+GetConnection(UserMapping *user, bool will_prep_stmt)
+{
+	ConnCacheEntry *entry;
+
+	/* Set flag that we did GetConnection during the current transaction */
+	xact_got_connection = true;
+
+	entry = get_connection_entry(user->umid);
+	
 	/*
 	 * We don't check the health of cached connection here, because it would
 	 * require some overhead.  Broken connection will be detected when the
@@ -176,6 +192,25 @@ GetConnection(UserMapping *user, bool will_prep_stmt)
 }
 
 /*
+ * Rerturns the connection specific storage for this user. Allocate with
+ * initsize if not exists.
+ */
+void *
+GetConnectionSpecificStorage(UserMapping *user, size_t initsize)
+{
+	ConnCacheEntry *entry;
+
+	entry = get_connection_entry(user->umid);
+	if (entry->storage == NULL)
+	{
+		entry->storage = MemoryContextAlloc(CacheMemoryContext, initsize);
+		memset(entry->storage, 0, initsize);
+	}
+
+	return entry->storage;
+}
+
+/*
  * Connect to remote server using specified server and user mapping properties.
  */
 static PGconn *
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index d97e694..292e59c 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -5561,27 +5561,33 @@ delete from foo where f1 < 5 returning *;
 (5 rows)
 
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
- Update on public.bar
-   Output: bar.f1, bar.f2
-   Update on public.bar
-   Foreign Update on public.bar2
-   ->  Seq Scan on public.bar
-         Output: bar.f1, (bar.f2 + 100), bar.ctid
-   ->  Foreign Update on public.bar2
-         Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
-(8 rows)
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+                                      QUERY PLAN                                      
+--------------------------------------------------------------------------------------
+ Sort
+   Output: u.f1, u.f2
+   Sort Key: u.f1
+   CTE u
+     ->  Update on public.bar
+           Output: bar.f1, bar.f2
+           Update on public.bar
+           Foreign Update on public.bar2
+           ->  Seq Scan on public.bar
+                 Output: bar.f1, (bar.f2 + 100), bar.ctid
+           ->  Foreign Update on public.bar2
+                 Remote SQL: UPDATE public.loct2 SET f2 = (f2 + 100) RETURNING f1, f2
+   ->  CTE Scan on u
+         Output: u.f1, u.f2
+(14 rows)
 
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
  f1 | f2  
 ----+-----
   1 | 311
   2 | 322
-  6 | 266
   3 | 333
   4 | 344
+  6 | 266
   7 | 277
 (6 rows)
 
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index daf0438..eb02a73 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -19,6 +19,7 @@
 #include "commands/defrem.h"
 #include "commands/explain.h"
 #include "commands/vacuum.h"
+#include "executor/execAsync.h"
 #include "foreign/fdwapi.h"
 #include "funcapi.h"
 #include "miscadmin.h"
@@ -50,6 +51,8 @@ PG_MODULE_MAGIC;
 /* If no remote estimates, assume a sort costs 20% extra */
 #define DEFAULT_FDW_SORT_MULTIPLIER 1.2
 
+/* Retrive PgFdwScanState struct from ForeginScanState */
+#define GetPgFdwScanState(n) ((PgFdwScanState *)(n)->fdw_state)
 /*
  * Indexes of FDW-private information stored in fdw_private lists.
  *
@@ -119,10 +122,27 @@ enum FdwDirectModifyPrivateIndex
 };
 
 /*
+ * Connection private area structure.
+ */
+ typedef struct PgFdwConnspecate
+{
+	ForeignScanState *current_owner;	/* The node currently running a query
+										 * on this connection*/
+} PgFdwConnspecate;
+
+/* Execution state base type */
+typedef struct PgFdwState
+{
+	PGconn	   *conn;			/* connection for the scan */
+	PgFdwConnspecate *connspec;	/* connection private memory */
+} PgFdwState;
+
+/*
  * Execution state of a foreign scan using postgres_fdw.
  */
 typedef struct PgFdwScanState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table. NULL
 								 * for a foreign join scan. */
 	TupleDesc	tupdesc;		/* tuple descriptor of scan */
@@ -133,7 +153,6 @@ typedef struct PgFdwScanState
 	List	   *retrieved_attrs;	/* list of retrieved attribute numbers */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	unsigned int cursor_number; /* quasi-unique ID for my cursor */
 	bool		cursor_exists;	/* have we created the cursor? */
 	int			numParams;		/* number of parameters passed to query */
@@ -149,6 +168,12 @@ typedef struct PgFdwScanState
 	/* batch-level state, for optimizing rewinds and avoiding useless fetch */
 	int			fetch_ct_2;		/* Min(# of fetches done, 2) */
 	bool		eof_reached;	/* true if last fetch reached EOF */
+	bool		async_waiting;	/* true if requesting the parent to wait */
+	ForeignScanState *waiter;	/* Next node to run a query among nodes
+								 * sharing the same connection */
+	ForeignScanState *last_waiter;	/* A waiting node at the end of a waiting
+									 * list. Maintained only by the current
+									 * owner of the connection */
 
 	/* working memory contexts */
 	MemoryContext batch_cxt;	/* context holding current batch of tuples */
@@ -162,11 +187,11 @@ typedef struct PgFdwScanState
  */
 typedef struct PgFdwModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the scan */
 	char	   *p_name;			/* name of prepared statement, if created */
 
 	/* extracted fdw_private data */
@@ -189,6 +214,7 @@ typedef struct PgFdwModifyState
  */
 typedef struct PgFdwDirectModifyState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 
@@ -199,7 +225,6 @@ typedef struct PgFdwDirectModifyState
 	bool		set_processed;	/* do we set the command es_processed? */
 
 	/* for remote query execution */
-	PGconn	   *conn;			/* connection for the update */
 	int			numParams;		/* number of parameters passed to query */
 	FmgrInfo   *param_flinfo;	/* output conversion functions for them */
 	List	   *param_exprs;	/* executable expressions for param values */
@@ -219,6 +244,7 @@ typedef struct PgFdwDirectModifyState
  */
 typedef struct PgFdwAnalyzeState
 {
+	PgFdwState	s;				/* common structure */
 	Relation	rel;			/* relcache entry for the foreign table */
 	AttInMetadata *attinmeta;	/* attribute datatype conversion metadata */
 	List	   *retrieved_attrs;	/* attr numbers retrieved by query */
@@ -287,6 +313,7 @@ static void postgresBeginForeignScan(ForeignScanState *node, int eflags);
 static TupleTableSlot *postgresIterateForeignScan(ForeignScanState *node);
 static void postgresReScanForeignScan(ForeignScanState *node);
 static void postgresEndForeignScan(ForeignScanState *node);
+static void postgresShutdownForeignScan(ForeignScanState *node);
 static void postgresAddForeignUpdateTargets(Query *parsetree,
 								RangeTblEntry *target_rte,
 								Relation target_relation);
@@ -343,6 +370,8 @@ static void postgresGetForeignJoinPaths(PlannerInfo *root,
 							JoinPathExtraData *extra);
 static bool postgresRecheckForeignScan(ForeignScanState *node,
 						   TupleTableSlot *slot);
+static bool postgresAsyncConfigureWait(ForeignScanState *node,
+									   AsyncConfigMode mode);
 
 /*
  * Helper functions
@@ -363,7 +392,10 @@ static bool ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 						  EquivalenceClass *ec, EquivalenceMember *em,
 						  void *arg);
 static void create_cursor(ForeignScanState *node);
-static void fetch_more_data(ForeignScanState *node);
+static void request_more_data(ForeignScanState *node);
+static void fetch_received_data(ForeignScanState *node);
+static void vacate_connection(PgFdwState *fdwconn);
+static void absorb_current_result(ForeignScanState *node);
 static void close_cursor(PGconn *conn, unsigned int cursor_number);
 static void prepare_foreign_modify(PgFdwModifyState *fmstate);
 static const char **convert_prep_stmt_params(PgFdwModifyState *fmstate,
@@ -424,6 +456,7 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	routine->IterateForeignScan = postgresIterateForeignScan;
 	routine->ReScanForeignScan = postgresReScanForeignScan;
 	routine->EndForeignScan = postgresEndForeignScan;
+	routine->ShutdownForeignScan = postgresShutdownForeignScan;
 
 	/* Functions for updating foreign tables */
 	routine->AddForeignUpdateTargets = postgresAddForeignUpdateTargets;
@@ -455,6 +488,9 @@ postgres_fdw_handler(PG_FUNCTION_ARGS)
 	/* Support functions for join push-down */
 	routine->GetForeignJoinPaths = postgresGetForeignJoinPaths;
 
+	/* Support functions for asynchronous execution */
+	routine->AsyncConfigureWait = postgresAsyncConfigureWait;
+
 	PG_RETURN_POINTER(routine);
 }
 
@@ -1298,12 +1334,20 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	fsstate->conn = GetConnection(user, false);
+	fsstate->s.conn = GetConnection(user, false);
+	fsstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
+	fsstate->s.connspec->current_owner = NULL;
+	fsstate->waiter = NULL;
+	fsstate->last_waiter = node;
 
 	/* Assign a unique ID for my cursor */
-	fsstate->cursor_number = GetCursorNumber(fsstate->conn);
+	fsstate->cursor_number = GetCursorNumber(fsstate->s.conn);
 	fsstate->cursor_exists = false;
 
+	/* Initialize async execution status */
+	fsstate->async_waiting = false;
+
 	/* Get private info created by planner functions. */
 	fsstate->query = strVal(list_nth(fsplan->fdw_private,
 									 FdwScanPrivateSelectSql));
@@ -1359,27 +1403,122 @@ postgresBeginForeignScan(ForeignScanState *node, int eflags)
 static TupleTableSlot *
 postgresIterateForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
 
 	/*
-	 * If this is the first call after Begin or ReScan, we need to create the
-	 * cursor on the remote side.
-	 */
-	if (!fsstate->cursor_exists)
-		create_cursor(node);
-
-	/*
 	 * Get some more tuples, if we've run out.
 	 */
 	if (fsstate->next_tuple >= fsstate->num_tuples)
 	{
-		/* No point in another fetch if we already detected EOF, though. */
-		if (!fsstate->eof_reached)
-			fetch_more_data(node);
-		/* If we didn't get any tuples, must be end of data. */
+		ForeignScanState *next_conn_owner = node;
+	
+		/* This node has sent a query on this connection */
+		if (fsstate->s.connspec->current_owner == node)
+		{
+			/* Check if the result is available */
+			if (PQisBusy(fsstate->s.conn))
+			{
+				int rc = WaitLatchOrSocket(NULL,
+										   WL_SOCKET_READABLE | WL_TIMEOUT,
+										   PQsocket(fsstate->s.conn), 0);
+				if (!(rc & WL_SOCKET_READABLE))
+				{
+					/*
+					 * This node is not ready yet. Tell the caller to wait.
+					 */
+					node->ss.ps.result_ready = false;
+					return ExecClearTuple(slot);
+				}
+			}					
+
+			Assert(fsstate->async_waiting);
+
+			ExecAsyncDoesNotNeedWait((PlanState *) node);
+			fsstate->async_waiting = false;
+
+			fetch_received_data(node);
+
+			/*
+			 * If someone is waiting this node on the same connection, let the
+			 * first waiter be the next owner of this connection.
+			 */
+			if (fsstate->waiter)
+			{
+				PgFdwScanState *next_owner_state;
+
+				next_conn_owner = fsstate->waiter;
+				next_owner_state = GetPgFdwScanState(next_conn_owner);
+				fsstate->waiter = NULL;
+
+				/*
+				 * only the current owner is responsible to maintain the shortcut
+				 * to the last waiter
+				 */
+				next_owner_state->last_waiter = fsstate->last_waiter;
+
+				/*
+				 * for simplicity, last_waiter points itself on a node that no one
+				 * is waiting for.
+				 */
+				fsstate->last_waiter = node;
+			}
+		}
+		else if (fsstate->s.connspec->current_owner)
+		{
+			/*
+			 * Anyone else is owning this connection. Add myself to the tail
+			 * of the waiters' list then return not-ready.  To avoid scanning
+			 * through the waiters' list, the current owner is to maintain the
+			 * shortcut to the last waiter.
+			 */
+			PgFdwScanState *conn_owner_state =
+				GetPgFdwScanState(fsstate->s.connspec->current_owner);
+			ForeignScanState *last_waiter = conn_owner_state->last_waiter;
+			PgFdwScanState *last_waiter_state = GetPgFdwScanState(last_waiter);
+
+			last_waiter_state->waiter = node;
+			conn_owner_state->last_waiter = node;
+
+			/* Register the node to the async-waiting node list */
+			Assert(!GetPgFdwScanState(node)->async_waiting);
+
+			ExecAsyncNeedsWait((PlanState *) node, 1, false);
+			GetPgFdwScanState(node)->async_waiting = true;
+
+			node->ss.ps.result_ready = false;
+			return ExecClearTuple(slot);
+		}
+
+		/*
+		 * Send the next request for the next owner of this connection if
+		 * needed.
+		 */
+
+		if (!GetPgFdwScanState(next_conn_owner)->eof_reached)
+		{
+			PgFdwScanState *next_owner_state =
+				GetPgFdwScanState(next_conn_owner);
+
+			request_more_data(next_conn_owner);
+
+			/* Register the node to the async-waiting node list */
+			if (!next_owner_state->async_waiting)
+			{
+				ExecAsyncNeedsWait((PlanState *) next_conn_owner, 1, false);
+				next_owner_state->async_waiting = true;
+			}
+		}
+
+		/*
+		 * If we haven't received a result for the given node this time,
+		 * return with no tuple to give way to other nodes.
+		 */
 		if (fsstate->next_tuple >= fsstate->num_tuples)
+		{
+			node->ss.ps.result_ready = fsstate->eof_reached;
 			return ExecClearTuple(slot);
+		}
 	}
 
 	/*
@@ -1393,6 +1532,73 @@ postgresIterateForeignScan(ForeignScanState *node)
 	return slot;
 }
 
+
+static bool
+postgresAsyncConfigureWait(ForeignScanState *node, AsyncConfigMode mode)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	EState *estate = node->ss.ps.state;
+
+	if ((mode == ASYNCCONF_TRY_ADD || mode == ASYNCCONF_FORCE_ADD) &&
+		fsstate->s.connspec->current_owner == node)
+	{
+		AddWaitEventToSet(estate->wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, node);
+		return true;
+	}
+
+	if (mode == ASYNCCONF_FORCE_ADD && fsstate->s.connspec->current_owner)
+	{
+		/*
+		 * We should somehow set a wait event. This occurs when the connection
+		 * owner does not resides in the current waiters' list. For the case,
+		 * forcibly make the connection owner finish the current request and
+		 * usurp the connection.
+		 */
+		ForeignScanState *owner = fsstate->s.connspec->current_owner;
+		PgFdwScanState *owner_state = GetPgFdwScanState(owner);
+		ForeignScanState *prev_waiter, *node_tmp;
+
+		fetch_received_data(owner);
+
+		/* find myself in the waiters' list */
+		prev_waiter = owner;
+
+		while (GetPgFdwScanState(prev_waiter)->waiter != node)
+			prev_waiter = GetPgFdwScanState(prev_waiter)->waiter;
+
+		/* Swap the previous owner and this node */
+		node_tmp = fsstate->waiter;
+
+		if (owner_state->waiter == node)
+			node_tmp = owner;
+		else
+		{
+			node_tmp = owner_state->waiter;
+			GetPgFdwScanState(prev_waiter)->waiter = owner;
+		}
+
+		owner_state->waiter = fsstate->waiter;
+		fsstate->waiter = node_tmp;
+
+		if (owner_state->last_waiter == node)
+			fsstate->last_waiter = prev_waiter;
+		else
+			fsstate->last_waiter = owner_state->last_waiter;
+		
+		request_more_data(node);
+		
+		/* now I am the connection owner */
+		AddWaitEventToSet(estate->wait_event_set,
+						  WL_SOCKET_READABLE, PQsocket(fsstate->s.conn),
+						  NULL, node);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * postgresReScanForeignScan
  *		Restart the scan.
@@ -1400,7 +1606,7 @@ postgresIterateForeignScan(ForeignScanState *node)
 static void
 postgresReScanForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	char		sql[64];
 	PGresult   *res;
 
@@ -1408,6 +1614,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	if (!fsstate->cursor_exists)
 		return;
 
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+
 	/*
 	 * If any internal parameters affecting this node have changed, we'd
 	 * better destroy and recreate the cursor.  Otherwise, rewinding it should
@@ -1436,9 +1645,9 @@ postgresReScanForeignScan(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_exec_query(fsstate->conn, sql);
+	res = pgfdw_exec_query(fsstate->s.conn, sql);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fsstate->conn, true, sql);
+		pgfdw_report_error(ERROR, res, fsstate->s.conn, true, sql);
 	PQclear(res);
 
 	/* Now force a fresh FETCH. */
@@ -1456,7 +1665,7 @@ postgresReScanForeignScan(ForeignScanState *node)
 static void
 postgresEndForeignScan(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 
 	/* if fsstate is NULL, we are in EXPLAIN; nothing to do */
 	if (fsstate == NULL)
@@ -1464,16 +1673,39 @@ postgresEndForeignScan(ForeignScanState *node)
 
 	/* Close the cursor if open, to prevent accumulation of cursors */
 	if (fsstate->cursor_exists)
-		close_cursor(fsstate->conn, fsstate->cursor_number);
+		close_cursor(fsstate->s.conn, fsstate->cursor_number);
 
 	/* Release remote connection */
-	ReleaseConnection(fsstate->conn);
-	fsstate->conn = NULL;
+	ReleaseConnection(fsstate->s.conn);
+	fsstate->s.conn = NULL;
 
 	/* MemoryContexts will be deleted automatically. */
 }
 
 /*
+ * postgresShutdownForeignScan
+ *		Remove asynchrony stuff and cleanup garbage on the connection.
+ */
+static void
+postgresShutdownForeignScan(ForeignScanState *node)
+{
+	ForeignScan *plan = (ForeignScan *) node->ss.ps.plan;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+
+	if (plan->operation != CMD_SELECT)
+		return;
+
+	if (fsstate->async_waiting)
+	{
+		ExecAsyncDoesNotNeedWait((PlanState *) node);
+		fsstate->async_waiting = false;
+	}
+
+	/* Absorb the ramining result */
+	absorb_current_result(node);
+}
+
+/*
  * postgresAddForeignUpdateTargets
  *		Add resjunk column(s) needed for update/delete on a foreign table
  */
@@ -1675,7 +1907,9 @@ postgresBeginForeignModify(ModifyTableState *mtstate,
 	user = GetUserMapping(userid, table->serverid);
 
 	/* Open connection; report that we'll create a prepared statement. */
-	fmstate->conn = GetConnection(user, true);
+	fmstate->s.conn = GetConnection(user, true);
+	fmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 	fmstate->p_name = NULL;		/* prepared statement not made yet */
 
 	/* Deconstruct fdw_private data. */
@@ -1754,6 +1988,8 @@ postgresExecForeignInsert(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1764,14 +2000,15 @@ postgresExecForeignInsert(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1779,10 +2016,10 @@ postgresExecForeignInsert(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1820,6 +2057,8 @@ postgresExecForeignUpdate(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1840,14 +2079,15 @@ postgresExecForeignUpdate(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1855,10 +2095,10 @@ postgresExecForeignUpdate(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1896,6 +2136,8 @@ postgresExecForeignDelete(EState *estate,
 	PGresult   *res;
 	int			n_rows;
 
+	vacate_connection((PgFdwState *)fmstate);
+
 	/* Set up the prepared statement on the remote server, if we didn't yet */
 	if (!fmstate->p_name)
 		prepare_foreign_modify(fmstate);
@@ -1916,14 +2158,15 @@ postgresExecForeignDelete(EState *estate,
 	/*
 	 * Execute the prepared statement.
 	 */
-	if (!PQsendQueryPrepared(fmstate->conn,
+	if (!PQsendQueryPrepared(fmstate->s.conn,
 							 fmstate->p_name,
 							 fmstate->p_nums,
 							 p_values,
 							 NULL,
 							 NULL,
 							 0))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -1931,10 +2174,10 @@ postgresExecForeignDelete(EState *estate,
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) !=
 		(fmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 
 	/* Check number of rows affected, and fetch RETURNING tuple if any */
 	if (fmstate->has_returning)
@@ -1981,16 +2224,16 @@ postgresEndForeignModify(EState *estate,
 		 * We don't use a PG_TRY block here, so be careful not to throw error
 		 * without releasing the PGresult.
 		 */
-		res = pgfdw_exec_query(fmstate->conn, sql);
+		res = pgfdw_exec_query(fmstate->s.conn, sql);
 		if (PQresultStatus(res) != PGRES_COMMAND_OK)
-			pgfdw_report_error(ERROR, res, fmstate->conn, true, sql);
+			pgfdw_report_error(ERROR, res, fmstate->s.conn, true, sql);
 		PQclear(res);
 		fmstate->p_name = NULL;
 	}
 
 	/* Release remote connection */
-	ReleaseConnection(fmstate->conn);
-	fmstate->conn = NULL;
+	ReleaseConnection(fmstate->s.conn);
+	fmstate->s.conn = NULL;
 }
 
 /*
@@ -2270,7 +2513,9 @@ postgresBeginDirectModify(ForeignScanState *node, int eflags)
 	 * Get connection to the foreign server.  Connection manager will
 	 * establish new connection if necessary.
 	 */
-	dmstate->conn = GetConnection(user, false);
+	dmstate->s.conn = GetConnection(user, false);
+	dmstate->s.connspec = (PgFdwConnspecate *)
+		GetConnectionSpecificStorage(user, sizeof(PgFdwConnspecate));
 
 	/* Initialize state variable */
 	dmstate->num_tuples = -1;	/* -1 means not set yet */
@@ -2323,7 +2568,10 @@ postgresIterateDirectModify(ForeignScanState *node)
 	 * If this is the first call after Begin, execute the statement.
 	 */
 	if (dmstate->num_tuples == -1)
+	{
+		vacate_connection((PgFdwState *)dmstate);
 		execute_dml_stmt(node);
+	}
 
 	/*
 	 * If the local query doesn't specify RETURNING, just clear tuple slot.
@@ -2370,8 +2618,8 @@ postgresEndDirectModify(ForeignScanState *node)
 		PQclear(dmstate->result);
 
 	/* Release remote connection */
-	ReleaseConnection(dmstate->conn);
-	dmstate->conn = NULL;
+	ReleaseConnection(dmstate->s.conn);
+	dmstate->s.conn = NULL;
 
 	/* MemoryContext will be deleted automatically. */
 }
@@ -2489,6 +2737,7 @@ estimate_path_cost_size(PlannerInfo *root,
 		List	   *local_param_join_conds;
 		StringInfoData sql;
 		PGconn	   *conn;
+		PgFdwConnspecate *connspec;
 		Selectivity local_sel;
 		QualCost	local_cost;
 		List	   *fdw_scan_tlist = NIL;
@@ -2531,6 +2780,16 @@ estimate_path_cost_size(PlannerInfo *root,
 
 		/* Get the remote estimate */
 		conn = GetConnection(fpinfo->user, false);
+		connspec = GetConnectionSpecificStorage(fpinfo->user,
+												sizeof(PgFdwConnspecate));
+		if (connspec)
+		{
+			PgFdwState tmpstate;
+			tmpstate.conn = conn;
+			tmpstate.connspec = connspec;
+			vacate_connection(&tmpstate);
+		}
+		
 		get_remote_estimate(sql.data, conn, &rows, &width,
 							&startup_cost, &total_cost);
 		ReleaseConnection(conn);
@@ -2810,11 +3069,11 @@ ec_member_matches_foreign(PlannerInfo *root, RelOptInfo *rel,
 static void
 create_cursor(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	ExprContext *econtext = node->ss.ps.ps_ExprContext;
 	int			numParams = fsstate->numParams;
 	const char **values = fsstate->param_values;
-	PGconn	   *conn = fsstate->conn;
+	PGconn	   *conn = fsstate->s.conn;
 	StringInfoData buf;
 	PGresult   *res;
 
@@ -2880,47 +3139,96 @@ create_cursor(ForeignScanState *node)
  * Fetch some more rows from the node's cursor.
  */
 static void
-fetch_more_data(ForeignScanState *node)
+request_more_data(ForeignScanState *node)
 {
-	PgFdwScanState *fsstate = (PgFdwScanState *) node->fdw_state;
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	PGconn	   *conn = fsstate->s.conn;
+	char		sql[64];
+
+	/* The connection should be vacant */
+	Assert(fsstate->s.connspec->current_owner == NULL);
+
+	/*
+	 * If this is the first call after Begin or ReScan, we need to create the
+	 * cursor on the remote side.
+	 */
+	if (!fsstate->cursor_exists)
+		create_cursor(node);
+
+	snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
+			 fsstate->fetch_size, fsstate->cursor_number);
+
+	if (!PQsendQuery(conn, sql))
+		pgfdw_report_error(ERROR, NULL, conn, false, sql);
+
+	fsstate->s.connspec->current_owner = node;
+}
+
+/*
+ * Fetch some more rows from the node's cursor.
+ */
+static void
+fetch_received_data(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
 	PGresult   *volatile res = NULL;
 	MemoryContext oldcontext;
 
+	/* I should be the current connection owner */
+	Assert(fsstate->s.connspec->current_owner == node);
+
 	/*
 	 * We'll store the tuples in the batch_cxt.  First, flush the previous
-	 * batch.
+	 * batch if no tuple is remaining
 	 */
-	fsstate->tuples = NULL;
-	MemoryContextReset(fsstate->batch_cxt);
+	if (fsstate->next_tuple >= fsstate->num_tuples)
+	{
+		fsstate->tuples = NULL;
+		fsstate->num_tuples = 0;
+		MemoryContextReset(fsstate->batch_cxt);
+	}
+	else if (fsstate->next_tuple > 0)
+	{
+		/* move the remaining tuples to the beginning of the store */
+		int n = 0;
+
+		while(fsstate->next_tuple < fsstate->num_tuples)
+			fsstate->tuples[n++] = fsstate->tuples[fsstate->next_tuple++];
+		fsstate->num_tuples = n;
+	}
+
 	oldcontext = MemoryContextSwitchTo(fsstate->batch_cxt);
 
 	/* PGresult must be released before leaving this function. */
 	PG_TRY();
 	{
-		PGconn	   *conn = fsstate->conn;
+		PGconn	   *conn = fsstate->s.conn;
 		char		sql[64];
-		int			numrows;
+		int			addrows;
+		size_t		newsize;
 		int			i;
 
 		snprintf(sql, sizeof(sql), "FETCH %d FROM c%u",
 				 fsstate->fetch_size, fsstate->cursor_number);
 
-		res = pgfdw_exec_query(conn, sql);
+		res = pgfdw_get_result(conn, sql);
 		/* On error, report the original query, not the FETCH. */
 		if (PQresultStatus(res) != PGRES_TUPLES_OK)
 			pgfdw_report_error(ERROR, res, conn, false, fsstate->query);
 
 		/* Convert the data into HeapTuples */
-		numrows = PQntuples(res);
-		fsstate->tuples = (HeapTuple *) palloc0(numrows * sizeof(HeapTuple));
-		fsstate->num_tuples = numrows;
-		fsstate->next_tuple = 0;
+		addrows = PQntuples(res);
+		newsize = (fsstate->num_tuples + addrows) * sizeof(HeapTuple);
+		if (fsstate->tuples)
+			fsstate->tuples = (HeapTuple *) repalloc(fsstate->tuples, newsize);
+		else
+			fsstate->tuples = (HeapTuple *) palloc(newsize);
 
-		for (i = 0; i < numrows; i++)
+		for (i = 0; i < addrows; i++)
 		{
 			Assert(IsA(node->ss.ps.plan, ForeignScan));
 
-			fsstate->tuples[i] =
+			fsstate->tuples[fsstate->num_tuples + i] =
 				make_tuple_from_result_row(res, i,
 										   fsstate->rel,
 										   fsstate->attinmeta,
@@ -2930,26 +3238,81 @@ fetch_more_data(ForeignScanState *node)
 		}
 
 		/* Update fetch_ct_2 */
-		if (fsstate->fetch_ct_2 < 2)
+		if (fsstate->fetch_ct_2 < 2 && fsstate->next_tuple == 0)
 			fsstate->fetch_ct_2++;
 
+		fsstate->next_tuple = 0;
+		fsstate->num_tuples += addrows;
+
 		/* Must be EOF if we didn't get as many tuples as we asked for. */
-		fsstate->eof_reached = (numrows < fsstate->fetch_size);
+		fsstate->eof_reached = (addrows < fsstate->fetch_size);
 
 		PQclear(res);
 		res = NULL;
 	}
 	PG_CATCH();
 	{
+		fsstate->s.connspec->current_owner = NULL;
 		if (res)
 			PQclear(res);
 		PG_RE_THROW();
 	}
 	PG_END_TRY();
 
+	fsstate->s.connspec->current_owner = NULL;
+
 	MemoryContextSwitchTo(oldcontext);
 }
 
+/* 
+ * Vacate a connection so that this node can send the next query
+ */
+static void
+vacate_connection(PgFdwState *fdwstate)
+{
+	PgFdwConnspecate *connspec = fdwstate->connspec;
+	ForeignScanState *owner;
+
+	if (connspec == NULL || connspec->current_owner == NULL)
+		return;
+
+	/*
+	 * let the current connection owner read the result for the running query
+	 */
+	owner = connspec->current_owner;
+	fetch_received_data(owner);
+
+	/* Clear the waiting list */
+	while (owner)
+	{
+		PgFdwScanState *fsstate = GetPgFdwScanState(owner);
+
+		fsstate->last_waiter = NULL;
+		owner = fsstate->waiter;
+		fsstate->waiter = NULL;
+	}
+}
+
+/*
+ * Absorb the result of the current query.
+ */
+static void
+absorb_current_result(ForeignScanState *node)
+{
+	PgFdwScanState *fsstate = GetPgFdwScanState(node);
+	ForeignScanState *owner = fsstate->s.connspec->current_owner;
+
+	Assert(!fsstate->async_waiting);
+	if (owner)
+	{
+		PgFdwScanState *target_state = GetPgFdwScanState(owner);
+		PGconn *conn = target_state->s.conn;
+
+		while(PQisBusy(conn))
+			PQclear(PQgetResult(conn));
+		fsstate->s.connspec->current_owner = NULL;
+	}
+}
 /*
  * Force assorted GUC parameters to settings that ensure that we'll output
  * data values in a form that is unambiguous to the remote server.
@@ -3034,7 +3397,7 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 
 	/* Construct name we'll use for the prepared statement. */
 	snprintf(prep_name, sizeof(prep_name), "pgsql_fdw_prep_%u",
-			 GetPrepStmtNumber(fmstate->conn));
+			 GetPrepStmtNumber(fmstate->s.conn));
 	p_name = pstrdup(prep_name);
 
 	/*
@@ -3044,12 +3407,13 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * the prepared statements we use in this module are simple enough that
 	 * the remote server will make the right choices.
 	 */
-	if (!PQsendPrepare(fmstate->conn,
+	if (!PQsendPrepare(fmstate->s.conn,
 					   p_name,
 					   fmstate->query,
 					   0,
 					   NULL))
-		pgfdw_report_error(ERROR, NULL, fmstate->conn, false, fmstate->query);
+		pgfdw_report_error(ERROR, NULL, fmstate->s.conn,
+						   false, fmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3057,9 +3421,9 @@ prepare_foreign_modify(PgFdwModifyState *fmstate)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	res = pgfdw_get_result(fmstate->conn, fmstate->query);
+	res = pgfdw_get_result(fmstate->s.conn, fmstate->query);
 	if (PQresultStatus(res) != PGRES_COMMAND_OK)
-		pgfdw_report_error(ERROR, res, fmstate->conn, true, fmstate->query);
+		pgfdw_report_error(ERROR, res, fmstate->s.conn, true, fmstate->query);
 	PQclear(res);
 
 	/* This action shows that the prepare has been done. */
@@ -3190,9 +3554,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * the desired result.  This allows us to avoid assuming that the remote
 	 * server has the same OIDs we do for the parameters' types.
 	 */
-	if (!PQsendQueryParams(dmstate->conn, dmstate->query, numParams,
+	if (!PQsendQueryParams(dmstate->s.conn, dmstate->query, numParams,
 						   NULL, values, NULL, NULL, 0))
-		pgfdw_report_error(ERROR, NULL, dmstate->conn, false, dmstate->query);
+		pgfdw_report_error(ERROR, NULL, dmstate->s.conn,
+						   false, dmstate->query);
 
 	/*
 	 * Get the result, and check for success.
@@ -3200,10 +3565,10 @@ execute_dml_stmt(ForeignScanState *node)
 	 * We don't use a PG_TRY block here, so be careful not to throw error
 	 * without releasing the PGresult.
 	 */
-	dmstate->result = pgfdw_get_result(dmstate->conn, dmstate->query);
+	dmstate->result = pgfdw_get_result(dmstate->s.conn, dmstate->query);
 	if (PQresultStatus(dmstate->result) !=
 		(dmstate->has_returning ? PGRES_TUPLES_OK : PGRES_COMMAND_OK))
-		pgfdw_report_error(ERROR, dmstate->result, dmstate->conn, true,
+		pgfdw_report_error(ERROR, dmstate->result, dmstate->s.conn, true,
 						   dmstate->query);
 
 	/* Get the number of rows affected. */
@@ -4387,7 +4752,7 @@ make_tuple_from_result_row(PGresult *res,
 		PgFdwScanState *fdw_sstate;
 
 		Assert(fsstate);
-		fdw_sstate = (PgFdwScanState *) fsstate->fdw_state;
+		fdw_sstate = GetPgFdwScanState(fsstate);
 		tupdesc = fdw_sstate->tupdesc;
 	}
 
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index 67126bc..b0c1266 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -79,7 +79,8 @@ typedef struct PgFdwRelationInfo
 	UserMapping *user;			/* only set in use_remote_estimate mode */
 
 	int			fetch_size;		/* fetch size for this remote table */
-
+	bool		allow_prefetch;	/* true to allow overlapped fetching  */
+	
 	/*
 	 * Name of the relation while EXPLAINing ForeignScan. It is used for join
 	 * relations but is set for all relations. For join relation, the name
@@ -100,6 +101,7 @@ extern void reset_transmission_modes(int nestlevel);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
+void *GetConnectionSpecificStorage(UserMapping *user, size_t initsize);
 extern void ReleaseConnection(PGconn *conn);
 extern unsigned int GetCursorNumber(PGconn *conn);
 extern unsigned int GetPrepStmtNumber(PGconn *conn);
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f68e89..de1d96e 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -1248,8 +1248,8 @@ explain (verbose, costs off)
 delete from foo where f1 < 5 returning *;
 delete from foo where f1 < 5 returning *;
 explain (verbose, costs off)
-update bar set f2 = f2 + 100 returning *;
-update bar set f2 = f2 + 100 returning *;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
+with u as (update bar set f2 = f2 + 100 returning *) select * from u order by 1;
 
 drop table foo cascade;
 drop table bar cascade;
-- 
2.9.2

0007-Make-Append-node-async-aware.patchtext/x-patch; charset=us-asciiDownload
From fad8ffd31b9ea3551749db0e87b7a1f48732217b Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Tue, 28 Jun 2016 18:52:37 +0900
Subject: [PATCH 7/7] Make Append node async-aware.

Change append node to be capable to handle asynchronous children
properly. As soon as it receives !async_ready from a child, it moves
to the next child and if no child is ready, it sleeps until at least
one of them become ready.
---
 src/backend/executor/nodeAppend.c | 67 +++++++++++++++++++++++++++++++++++++--
 src/include/nodes/execnodes.h     |  2 ++
 2 files changed, 67 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index e0ce8c6..9a4063a 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -58,6 +58,7 @@
 #include "postgres.h"
 
 #include "executor/execdebug.h"
+#include "executor/execAsync.h"
 #include "executor/nodeAppend.h"
 
 static bool exec_append_initialize_next(AppendState *appendstate);
@@ -121,6 +122,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 {
 	AppendState *appendstate = makeNode(AppendState);
 	PlanState **appendplanstates;
+	bool	   *finished;
 	int			nplans;
 	int			i;
 	ListCell   *lc;
@@ -134,14 +136,17 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	nplans = list_length(node->appendplans);
 
 	appendplanstates = (PlanState **) palloc0(nplans * sizeof(PlanState *));
-
+	finished = (bool *) palloc0(nplans * sizeof(bool));
+	
 	/*
 	 * create new AppendState for our append node
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
 	appendstate->appendplans = appendplanstates;
+	appendstate->as_finished = finished;
 	appendstate->as_nplans = nplans;
+	appendstate->as_async = ((eflags & EXEC_FLAG_BACKWARD) == 0);
 
 	/*
 	 * Miscellaneous initialization
@@ -194,6 +199,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 void
 ExecAppend(AppendState *node)
 {
+	int stopplan = node->as_whichplan;
+
 	for (;;)
 	{
 		PlanState  *subnode;
@@ -207,7 +214,36 @@ ExecAppend(AppendState *node)
 		/*
 		 * get a tuple from the subplan
 		 */
-		result = ExecProcNode(subnode);
+		Assert(!node->as_finished[node->as_whichplan]);
+
+		if (!subnode->result_ready)
+			ExecExecuteNode(subnode);
+
+		if (!subnode->result_ready)
+		{
+			if (node->as_async)
+			{
+				/* Move to the next living node */
+				do
+				{
+					node->as_whichplan = 
+						(node->as_whichplan + 1) %  node->as_nplans;
+				} while (node->as_whichplan != stopplan &&
+						 node->as_finished[node->as_whichplan]);
+
+				/* No node is ready yet, return as not-ready */
+				if (node->as_whichplan == stopplan)
+					return;
+
+				/* Try the next node */
+				continue;
+			}
+
+			/* If not async, immediately wait for this subnode */
+			ExecAsyncWaitForNode(subnode);
+		}				
+
+		result = ExecConsumeResult((PlanState *) subnode);
 
 		if (!TupIsNull(result))
 		{
@@ -220,6 +256,31 @@ ExecAppend(AppendState *node)
 			return;
 		}
 
+		if (node->as_async)
+		{
+			node->as_finished[node->as_whichplan] = true;
+			stopplan = node->as_whichplan;
+
+			/* Find the next living subnode */
+			do
+			{
+				node->as_whichplan =
+					(node->as_whichplan + 1) % node->as_nplans;
+			} while (node->as_whichplan != stopplan &&
+					 node->as_finished[node->as_whichplan]);
+
+			if (node->as_whichplan != stopplan)
+			{
+				stopplan = node->as_whichplan;
+				continue;
+			}
+
+			/* All subnodes are exhausted. Finish this node. */
+			ExecReturnTuple(&node->ps,
+							ExecClearTuple(node->ps.ps_ResultTupleSlot));
+			return;
+		}
+
 		/*
 		 * Go on to the "next" subplan in the appropriate direction. If no
 		 * more subplans, return the empty slot set up for us by
@@ -277,6 +338,8 @@ ExecReScanAppend(AppendState *node)
 	{
 		PlanState  *subnode = node->appendplans[i];
 
+		node->as_finished[i] = false;
+
 		/*
 		 * ExecReScan doesn't know about my subplans, so I have to do
 		 * changed-parameter signaling myself.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index b72decc..b0a86c5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1177,6 +1177,8 @@ typedef struct AppendState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	PlanState **appendplans;	/* array of PlanStates for my inputs */
+	bool		as_async;		/* true to allow async execution */
+	bool	   *as_finished;	/* array of the running state of subplans */
 	int			as_nplans;
 	int			as_whichplan;
 } AppendState;
-- 
2.9.2

0001-Modify-PlanState-to-include-a-pointer-to-the-parent-.patchtext/x-patch; charset=us-asciiDownload
From f035f2cb0c43100d66f6c5beb5f25ae58b1fb2cd Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Wed, 4 May 2016 12:19:03 -0400
Subject: [PATCH 1/7] Modify PlanState to include a pointer to the parent
 PlanState.

---
 src/backend/executor/execMain.c           | 22 ++++++++++++++--------
 src/backend/executor/execProcnode.c       |  5 ++++-
 src/backend/executor/nodeAgg.c            |  3 ++-
 src/backend/executor/nodeAppend.c         |  3 ++-
 src/backend/executor/nodeBitmapAnd.c      |  3 ++-
 src/backend/executor/nodeBitmapHeapscan.c |  3 ++-
 src/backend/executor/nodeBitmapOr.c       |  3 ++-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeGather.c         |  3 ++-
 src/backend/executor/nodeGroup.c          |  3 ++-
 src/backend/executor/nodeHash.c           |  3 ++-
 src/backend/executor/nodeHashjoin.c       |  6 ++++--
 src/backend/executor/nodeLimit.c          |  3 ++-
 src/backend/executor/nodeLockRows.c       |  3 ++-
 src/backend/executor/nodeMaterial.c       |  3 ++-
 src/backend/executor/nodeMergeAppend.c    |  3 ++-
 src/backend/executor/nodeMergejoin.c      |  4 +++-
 src/backend/executor/nodeModifyTable.c    |  3 ++-
 src/backend/executor/nodeNestloop.c       |  6 ++++--
 src/backend/executor/nodeRecursiveunion.c |  6 ++++--
 src/backend/executor/nodeResult.c         |  3 ++-
 src/backend/executor/nodeSetOp.c          |  3 ++-
 src/backend/executor/nodeSort.c           |  3 ++-
 src/backend/executor/nodeSubplan.c        |  1 +
 src/backend/executor/nodeSubqueryscan.c   |  3 ++-
 src/backend/executor/nodeUnique.c         |  3 ++-
 src/backend/executor/nodeWindowAgg.c      |  3 ++-
 src/include/executor/executor.h           |  3 ++-
 src/include/nodes/execnodes.h             |  2 ++
 29 files changed, 77 insertions(+), 37 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32bb3f9..ac6d62c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -923,7 +923,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize private state information for each SubPlan.  We must do this
 	 * before running ExecInitNode on the main query tree, since
-	 * ExecInitSubPlan expects to be able to find these entries.
+	 * ExecInitSubPlan expects to be able to find these entries. Since the
+	 * main plan tree hasn't been initialized yet, we have to pass NULL as the
+	 * parent node to ExecInitNode; ExecInitSubPlan also takes responsibility
+	 * for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	i = 1;						/* subplan indices count from 1 */
@@ -943,7 +946,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
-		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
+		subplanstate = ExecInitNode(subplan, estate, NULL, sp_eflags);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
@@ -954,9 +957,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	/*
 	 * Initialize the private state information for all the nodes in the query
 	 * tree.  This opens files, allocates storage and leaves us ready to start
-	 * processing tuples.
+	 * processing tuples.  This is the root planstate node; it has no parent.
 	 */
-	planstate = ExecInitNode(plan, estate, eflags);
+	planstate = ExecInitNode(plan, estate, NULL, eflags);
 
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
@@ -2849,7 +2852,9 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * ExecInitSubPlan expects to be able to find these entries. Some of the
 	 * SubPlans might not be used in the part of the plan tree we intend to
 	 * run, but since it's not easy to tell which, we just initialize them
-	 * all.
+	 * all.  Since the main plan tree hasn't been initialized yet, we have to
+	 * pass NULL as the parent node to ExecInitNode; ExecInitSubPlan also
+	 * takes responsibility for fixing up subplanstate->parent.
 	 */
 	Assert(estate->es_subplanstates == NIL);
 	foreach(l, parentestate->es_plannedstmt->subplans)
@@ -2857,7 +2862,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;
 
-		subplanstate = ExecInitNode(subplan, estate, 0);
+		subplanstate = ExecInitNode(subplan, estate, NULL, 0);
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
 	}
@@ -2865,9 +2870,10 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	/*
 	 * Initialize the private state information for all the nodes in the part
 	 * of the plan tree we need to run.  This opens files, allocates storage
-	 * and leaves us ready to start processing tuples.
+	 * and leaves us ready to start processing tuples.  This is the root plan
+	 * node; it has no parent.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->planstate = ExecInitNode(planTree, estate, NULL, 0);
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 554244f..680ca4b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -133,7 +133,7 @@
  * ------------------------------------------------------------------------
  */
 PlanState *
-ExecInitNode(Plan *node, EState *estate, int eflags)
+ExecInitNode(Plan *node, EState *estate, PlanState *parent, int eflags)
 {
 	PlanState  *result;
 	List	   *subps;
@@ -340,6 +340,9 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 			break;
 	}
 
+	/* Set parent pointer. */
+	result->parent = parent;
+
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
 	 * a separate list for us.
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ce2fc28..d6aa99c 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2427,7 +2427,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(aggstate) =
+		ExecInitNode(outerPlan, estate, &aggstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type.
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..beb4ab8 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -165,7 +165,8 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, &appendstate->ps,
+										   eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index c39d790..6405fa4 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -81,7 +81,8 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmapandstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..2ba5cd0 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -646,7 +646,8 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * relation's indexes, and we want to be sure we have acquired a lock on
 	 * the relation first.
 	 */
-	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate,
+											 &scanstate->ss.ps, eflags);
 
 	/*
 	 * all done.
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 7e928eb..faa3a37 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -82,7 +82,8 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate,
+										   &bitmaporstate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..7d9160d 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -224,7 +224,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	/* Initialize any outer plan. */
 	if (outerPlan(node))
 		outerPlanState(scanstate) =
-			ExecInitNode(outerPlan(node), estate, eflags);
+			ExecInitNode(outerPlan(node), estate, &scanstate->ss.ps, eflags);
 
 	/*
 	 * Tell the FDW to initialize the scan.
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 438d1b2..1bf5a31 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -97,7 +97,8 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gatherstate) =
+		ExecInitNode(outerNode, estate, &gatherstate->ps, eflags);
 
 	gatherstate->ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index dcf5175..3c066fc 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -233,7 +233,8 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(grpstate) =
+		ExecInitNode(outerPlan(node), estate, &grpstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6375d9b..8333e5c 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -200,7 +200,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(hashstate) =
+		ExecInitNode(outerPlan(node), estate, &hashstate->ps, eflags);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 369e666..a7a908a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -486,8 +486,10 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	outerNode = outerPlan(node);
 	hashNode = (Hash *) innerPlan(node);
 
-	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags);
-	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
+	outerPlanState(hjstate) =
+		ExecInitNode(outerNode, estate, &hjstate->js.ps, eflags);
+	innerPlanState(hjstate) =
+		ExecInitNode((Plan *) hashNode, estate, &hjstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index faf32e1..97267c5 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -412,7 +412,8 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(limitstate) =
+		ExecInitNode(outerPlan, estate, &limitstate->ps, eflags);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 4ebcaff..c4b5333 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -376,7 +376,8 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(lrstate) =
+		ExecInitNode(outerPlan, estate, &lrstate->ps, eflags);
 
 	/*
 	 * LockRows nodes do no projections, so initialize projection info for
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9ab03f3..82e31c1 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -219,7 +219,8 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
 	outerPlan = outerPlan(node);
-	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(matstate) =
+		ExecInitNode(outerPlan, estate, &matstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e271927..ae0e8dc 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -112,7 +112,8 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		mergeplanstates[i] =
+			ExecInitNode(initNode, estate, &mergestate->ps, eflags);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 6db09b8..cd8d6c6 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1522,8 +1522,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 *
 	 * inner child must support MARK/RESTORE.
 	 */
-	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(mergestate) =
+		ExecInitNode(outerPlan(node), estate, &mergestate->js.ps, eflags);
 	innerPlanState(mergestate) = ExecInitNode(innerPlan(node), estate,
+											  &mergestate->js.ps,
 											  eflags | EXEC_FLAG_MARK);
 
 	/*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index af7b26c..95cc2c6 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1618,7 +1618,8 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
-		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
+		mtstate->mt_plans[i] =
+			ExecInitNode(subplan, estate, &mtstate->ps, eflags);
 
 		/* Also let FDWs init themselves for foreign-table result rels */
 		if (!resultRelInfo->ri_usesFdwDirectModify &&
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 555fa09..1895b60 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -340,12 +340,14 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
-	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(nlstate) =
+		ExecInitNode(outerPlan(node), estate, &nlstate->js.ps, eflags);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
-	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
+	innerPlanState(nlstate) =
+		ExecInitNode(innerPlan(node), estate, &nlstate->js.ps, eflags);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 39be191..627370f 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -241,8 +241,10 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags);
-	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags);
+	outerPlanState(rustate) =
+		ExecInitNode(outerPlan(node), estate, &rustate->ps, eflags);
+	innerPlanState(rustate) =
+		ExecInitNode(innerPlan(node), estate, &rustate->ps, eflags);
 
 	/*
 	 * If hashing, precompute fmgr lookup data for inner loop, and create the
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 4007b76..0d2de14 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -250,7 +250,8 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(resstate) =
+		ExecInitNode(outerPlan(node), estate, &resstate->ps, eflags);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 633580b..8b05795 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -533,7 +533,8 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
-	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(setopstate) =
+		ExecInitNode(outerPlan(node), estate, &setopstate->ps, eflags);
 
 	/*
 	 * setop nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a34dcc5..0286a7f 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -199,7 +199,8 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(sortstate) =
+		ExecInitNode(outerPlan(node), estate, &sortstate->ss.ps, eflags);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeSubplan.c b/src/backend/executor/nodeSubplan.c
index 2cf169f..db19887 100644
--- a/src/backend/executor/nodeSubplan.c
+++ b/src/backend/executor/nodeSubplan.c
@@ -707,6 +707,7 @@ ExecInitSubPlan(SubPlan *subplan, PlanState *parent)
 
 	/* ... and to its parent's state */
 	sstate->parent = parent;
+	sstate->planstate->parent = parent;
 
 	/* Initialize subexpressions */
 	sstate->testexpr = ExecInitExpr((Expr *) subplan->testexpr, parent);
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 9bafc62..cb007a5 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -136,7 +136,8 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	/*
 	 * initialize subquery
 	 */
-	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags);
+	subquerystate->subplan =
+		ExecInitNode(node->subplan, estate, &subquerystate->ss.ps, eflags);
 
 	subquerystate->ss.ps.ps_TupFromTlist = false;
 
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index f45c792..3b89e84 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -143,7 +143,8 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(uniquestate) =
+		ExecInitNode(outerPlan(node), estate, &uniquestate->ps, eflags);
 
 	/*
 	 * unique nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 371548c..f12fe26 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1840,7 +1840,8 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(winstate) =
+		ExecInitNode(outerPlan, estate, &winstate->ss.ps, eflags);
 
 	/*
 	 * initialize source tuple type (which is also the tuple type that we'll
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 39521ed..28c0c2e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -221,7 +221,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 /*
  * prototypes from functions in execProcnode.c
  */
-extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
+extern PlanState *ExecInitNode(Plan *node, EState *estate, PlanState *parent,
+			 int eflags);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index e7fd7bd..4b18436 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1030,6 +1030,8 @@ typedef struct PlanState
 								 * nodes point to one EState for the whole
 								 * top-level plan */
 
+	struct PlanState *parent;	/* node which will receive tuples from us */
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
2.9.2

0002-Modify-PlanState-to-have-result-result_ready-fields..patchtext/x-patch; charset=us-asciiDownload
From f06450bb9f21603dcceb16d90cff45857bb312df Mon Sep 17 00:00:00 2001
From: Robert Haas <rhaas@postgresql.org>
Date: Fri, 6 May 2016 13:01:48 -0400
Subject: [PATCH 2/7] Modify PlanState to have result/result_ready fields.
 Modify executor to use them instead of returning tuples directly.

---
 src/backend/executor/execProcnode.c       | 75 ++++++++++++++++++-------------
 src/backend/executor/execScan.c           | 26 +++++++----
 src/backend/executor/nodeAgg.c            | 13 +++---
 src/backend/executor/nodeAppend.c         | 11 +++--
 src/backend/executor/nodeBitmapHeapscan.c |  2 +-
 src/backend/executor/nodeCtescan.c        |  2 +-
 src/backend/executor/nodeCustom.c         |  4 +-
 src/backend/executor/nodeForeignscan.c    |  2 +-
 src/backend/executor/nodeFunctionscan.c   |  2 +-
 src/backend/executor/nodeGather.c         | 17 ++++---
 src/backend/executor/nodeGroup.c          | 24 +++++++---
 src/backend/executor/nodeHash.c           |  3 +-
 src/backend/executor/nodeHashjoin.c       | 29 ++++++++----
 src/backend/executor/nodeIndexonlyscan.c  |  2 +-
 src/backend/executor/nodeIndexscan.c      |  2 +-
 src/backend/executor/nodeLimit.c          | 42 ++++++++++++-----
 src/backend/executor/nodeLockRows.c       |  9 ++--
 src/backend/executor/nodeMaterial.c       | 21 ++++++---
 src/backend/executor/nodeMergeAppend.c    |  4 +-
 src/backend/executor/nodeMergejoin.c      | 74 ++++++++++++++++++++++--------
 src/backend/executor/nodeModifyTable.c    | 15 ++++---
 src/backend/executor/nodeNestloop.c       | 16 ++++---
 src/backend/executor/nodeRecursiveunion.c | 10 +++--
 src/backend/executor/nodeResult.c         | 20 ++++++---
 src/backend/executor/nodeSamplescan.c     |  2 +-
 src/backend/executor/nodeSeqscan.c        |  2 +-
 src/backend/executor/nodeSetOp.c          | 14 +++---
 src/backend/executor/nodeSort.c           |  4 +-
 src/backend/executor/nodeSubqueryscan.c   |  2 +-
 src/backend/executor/nodeTidscan.c        |  2 +-
 src/backend/executor/nodeUnique.c         |  8 ++--
 src/backend/executor/nodeValuesscan.c     |  2 +-
 src/backend/executor/nodeWindowAgg.c      | 17 ++++---
 src/backend/executor/nodeWorktablescan.c  |  2 +-
 src/include/executor/executor.h           | 11 ++++-
 src/include/executor/nodeAgg.h            |  2 +-
 src/include/executor/nodeAppend.h         |  2 +-
 src/include/executor/nodeBitmapHeapscan.h |  2 +-
 src/include/executor/nodeCtescan.h        |  2 +-
 src/include/executor/nodeCustom.h         |  2 +-
 src/include/executor/nodeForeignscan.h    |  2 +-
 src/include/executor/nodeFunctionscan.h   |  2 +-
 src/include/executor/nodeGather.h         |  2 +-
 src/include/executor/nodeGroup.h          |  2 +-
 src/include/executor/nodeHash.h           |  2 +-
 src/include/executor/nodeHashjoin.h       |  2 +-
 src/include/executor/nodeIndexonlyscan.h  |  2 +-
 src/include/executor/nodeIndexscan.h      |  2 +-
 src/include/executor/nodeLimit.h          |  2 +-
 src/include/executor/nodeLockRows.h       |  2 +-
 src/include/executor/nodeMaterial.h       |  2 +-
 src/include/executor/nodeMergeAppend.h    |  2 +-
 src/include/executor/nodeMergejoin.h      |  2 +-
 src/include/executor/nodeModifyTable.h    |  2 +-
 src/include/executor/nodeNestloop.h       |  2 +-
 src/include/executor/nodeRecursiveunion.h |  2 +-
 src/include/executor/nodeResult.h         |  2 +-
 src/include/executor/nodeSamplescan.h     |  2 +-
 src/include/executor/nodeSeqscan.h        |  2 +-
 src/include/executor/nodeSetOp.h          |  2 +-
 src/include/executor/nodeSort.h           |  2 +-
 src/include/executor/nodeSubqueryscan.h   |  2 +-
 src/include/executor/nodeTidscan.h        |  2 +-
 src/include/executor/nodeUnique.h         |  2 +-
 src/include/executor/nodeValuesscan.h     |  2 +-
 src/include/executor/nodeWindowAgg.h      |  2 +-
 src/include/executor/nodeWorktablescan.h  |  2 +-
 src/include/nodes/execnodes.h             |  2 +
 68 files changed, 360 insertions(+), 197 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 680ca4b..3f2ebff 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -380,6 +380,9 @@ ExecProcNode(PlanState *node)
 
 	CHECK_FOR_INTERRUPTS();
 
+	/* mark any previous result as having been consumed */
+	node->result_ready = false;
+
 	if (node->chgParam != NULL) /* something changed */
 		ExecReScan(node);		/* let ReScan handle this */
 
@@ -392,23 +395,23 @@ ExecProcNode(PlanState *node)
 			 * control nodes
 			 */
 		case T_ResultState:
-			result = ExecResult((ResultState *) node);
+			ExecResult((ResultState *) node);
 			break;
 
 		case T_ModifyTableState:
-			result = ExecModifyTable((ModifyTableState *) node);
+			ExecModifyTable((ModifyTableState *) node);
 			break;
 
 		case T_AppendState:
-			result = ExecAppend((AppendState *) node);
+			ExecAppend((AppendState *) node);
 			break;
 
 		case T_MergeAppendState:
-			result = ExecMergeAppend((MergeAppendState *) node);
+			ExecMergeAppend((MergeAppendState *) node);
 			break;
 
 		case T_RecursiveUnionState:
-			result = ExecRecursiveUnion((RecursiveUnionState *) node);
+			ExecRecursiveUnion((RecursiveUnionState *) node);
 			break;
 
 			/* BitmapAndState does not yield tuples */
@@ -419,119 +422,119 @@ ExecProcNode(PlanState *node)
 			 * scan nodes
 			 */
 		case T_SeqScanState:
-			result = ExecSeqScan((SeqScanState *) node);
+			ExecSeqScan((SeqScanState *) node);
 			break;
 
 		case T_SampleScanState:
-			result = ExecSampleScan((SampleScanState *) node);
+			ExecSampleScan((SampleScanState *) node);
 			break;
 
 		case T_IndexScanState:
-			result = ExecIndexScan((IndexScanState *) node);
+			ExecIndexScan((IndexScanState *) node);
 			break;
 
 		case T_IndexOnlyScanState:
-			result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
+			ExecIndexOnlyScan((IndexOnlyScanState *) node);
 			break;
 
 			/* BitmapIndexScanState does not yield tuples */
 
 		case T_BitmapHeapScanState:
-			result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
+			ExecBitmapHeapScan((BitmapHeapScanState *) node);
 			break;
 
 		case T_TidScanState:
-			result = ExecTidScan((TidScanState *) node);
+			ExecTidScan((TidScanState *) node);
 			break;
 
 		case T_SubqueryScanState:
-			result = ExecSubqueryScan((SubqueryScanState *) node);
+			ExecSubqueryScan((SubqueryScanState *) node);
 			break;
 
 		case T_FunctionScanState:
-			result = ExecFunctionScan((FunctionScanState *) node);
+			ExecFunctionScan((FunctionScanState *) node);
 			break;
 
 		case T_ValuesScanState:
-			result = ExecValuesScan((ValuesScanState *) node);
+			ExecValuesScan((ValuesScanState *) node);
 			break;
 
 		case T_CteScanState:
-			result = ExecCteScan((CteScanState *) node);
+			ExecCteScan((CteScanState *) node);
 			break;
 
 		case T_WorkTableScanState:
-			result = ExecWorkTableScan((WorkTableScanState *) node);
+			ExecWorkTableScan((WorkTableScanState *) node);
 			break;
 
 		case T_ForeignScanState:
-			result = ExecForeignScan((ForeignScanState *) node);
+			ExecForeignScan((ForeignScanState *) node);
 			break;
 
 		case T_CustomScanState:
-			result = ExecCustomScan((CustomScanState *) node);
+			ExecCustomScan((CustomScanState *) node);
 			break;
 
 			/*
 			 * join nodes
 			 */
 		case T_NestLoopState:
-			result = ExecNestLoop((NestLoopState *) node);
+			ExecNestLoop((NestLoopState *) node);
 			break;
 
 		case T_MergeJoinState:
-			result = ExecMergeJoin((MergeJoinState *) node);
+			ExecMergeJoin((MergeJoinState *) node);
 			break;
 
 		case T_HashJoinState:
-			result = ExecHashJoin((HashJoinState *) node);
+			ExecHashJoin((HashJoinState *) node);
 			break;
 
 			/*
 			 * materialization nodes
 			 */
 		case T_MaterialState:
-			result = ExecMaterial((MaterialState *) node);
+			ExecMaterial((MaterialState *) node);
 			break;
 
 		case T_SortState:
-			result = ExecSort((SortState *) node);
+			ExecSort((SortState *) node);
 			break;
 
 		case T_GroupState:
-			result = ExecGroup((GroupState *) node);
+			ExecGroup((GroupState *) node);
 			break;
 
 		case T_AggState:
-			result = ExecAgg((AggState *) node);
+			ExecAgg((AggState *) node);
 			break;
 
 		case T_WindowAggState:
-			result = ExecWindowAgg((WindowAggState *) node);
+			ExecWindowAgg((WindowAggState *) node);
 			break;
 
 		case T_UniqueState:
-			result = ExecUnique((UniqueState *) node);
+			ExecUnique((UniqueState *) node);
 			break;
 
 		case T_GatherState:
-			result = ExecGather((GatherState *) node);
+			ExecGather((GatherState *) node);
 			break;
 
 		case T_HashState:
-			result = ExecHash((HashState *) node);
+			ExecHash((HashState *) node);
 			break;
 
 		case T_SetOpState:
-			result = ExecSetOp((SetOpState *) node);
+			ExecSetOp((SetOpState *) node);
 			break;
 
 		case T_LockRowsState:
-			result = ExecLockRows((LockRowsState *) node);
+			ExecLockRows((LockRowsState *) node);
 			break;
 
 		case T_LimitState:
-			result = ExecLimit((LimitState *) node);
+			ExecLimit((LimitState *) node);
 			break;
 
 		default:
@@ -540,6 +543,14 @@ ExecProcNode(PlanState *node)
 			break;
 	}
 
+	/* We don't support asynchronous execution yet. */
+	Assert(node->result_ready);
+
+	/* Result should be a TupleTableSlot, unless it's NULL. */
+	Assert(node->result == NULL || IsA(node->result, TupleTableSlot));
+
+	result = (TupleTableSlot *) node->result;
+
 	if (node->instrument)
 		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index fb0013d..095d40b 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -99,7 +99,7 @@ ExecScanFetch(ScanState *node,
  *		ExecScan
  *
  *		Scans the relation using the 'access method' indicated and
- *		returns the next qualifying tuple in the direction specified
+ *		produces the next qualifying tuple in the direction specified
  *		in the global variable ExecDirection.
  *		The access method returns the next tuple and ExecScan() is
  *		responsible for checking the tuple returned against the qual-clause.
@@ -117,7 +117,7 @@ ExecScanFetch(ScanState *node,
  *			 "cursor" is positioned before the first qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecScan(ScanState *node,
 		 ExecScanAccessMtd accessMtd,	/* function returning a tuple */
 		 ExecScanRecheckMtd recheckMtd)
@@ -137,12 +137,14 @@ ExecScan(ScanState *node,
 
 	/*
 	 * If we have neither a qual to check nor a projection to do, just skip
-	 * all the overhead and return the raw scan tuple.
+	 * all the overhead and produce the raw scan tuple.
 	 */
 	if (!qual && !projInfo)
 	{
 		ResetExprContext(econtext);
-		return ExecScanFetch(node, accessMtd, recheckMtd);
+		ExecReturnTuple(&node->ps,
+						ExecScanFetch(node, accessMtd, recheckMtd));
+		return;
 	}
 
 	/*
@@ -155,7 +157,10 @@ ExecScan(ScanState *node,
 		Assert(projInfo);		/* can't get here if not projecting */
 		resultSlot = ExecProject(projInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -188,9 +193,10 @@ ExecScan(ScanState *node,
 		if (TupIsNull(slot))
 		{
 			if (projInfo)
-				return ExecClearTuple(projInfo->pi_slot);
+				ExecReturnTuple(&node->ps, ExecClearTuple(projInfo->pi_slot));
 			else
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		/*
@@ -221,7 +227,8 @@ ExecScan(ScanState *node,
 				if (isDone != ExprEndResult)
 				{
 					node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-					return resultSlot;
+					ExecReturnTuple(&node->ps, resultSlot);
+					return;
 				}
 			}
 			else
@@ -229,7 +236,8 @@ ExecScan(ScanState *node,
 				/*
 				 * Here, we aren't projecting, so just return scan tuple.
 				 */
-				return slot;
+				ExecReturnTuple(&node->ps, slot);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index d6aa99c..e17d76c 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -1797,7 +1797,7 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
  *	  stored in the expression context to be used when ExecProject evaluates
  *	  the result tuple.
  */
-TupleTableSlot *
+void
 ExecAgg(AggState *node)
 {
 	TupleTableSlot *result;
@@ -1813,7 +1813,10 @@ ExecAgg(AggState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1823,6 +1826,7 @@ ExecAgg(AggState *node)
 	 * agg_done gets set before we emit the final aggregate tuple, and we have
 	 * to finish running SRFs for it.)
 	 */
+	result = NULL;
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -1837,12 +1841,9 @@ ExecAgg(AggState *node)
 				result = agg_retrieve_direct(node);
 				break;
 		}
-
-		if (!TupIsNull(result))
-			return result;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ss.ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index beb4ab8..e0ce8c6 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -191,7 +191,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecAppend(AppendState *node)
 {
 	for (;;)
@@ -216,7 +216,8 @@ ExecAppend(AppendState *node)
 			 * NOT make use of the result slot that was set up in
 			 * ExecInitAppend; there's no need for it.
 			 */
-			return result;
+			ExecReturnTuple(&node->ps, result);
+			return;
 		}
 
 		/*
@@ -229,7 +230,11 @@ ExecAppend(AppendState *node)
 		else
 			node->as_whichplan--;
 		if (!exec_append_initialize_next(node))
-			return ExecClearTuple(node->ps.ps_ResultTupleSlot);
+		{
+			ExecReturnTuple(&node->ps,
+							ExecClearTuple(node->ps.ps_ResultTupleSlot));
+			return;
+		}
 
 		/* Else loop back and try to get a tuple from the new subplan */
 	}
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2ba5cd0..31133ff 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -434,7 +434,7 @@ BitmapHeapRecheck(BitmapHeapScanState *node, TupleTableSlot *slot)
  *		ExecBitmapHeapScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecBitmapHeapScan(BitmapHeapScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 3c2f684..1f1fdf5 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -149,7 +149,7 @@ CteScanRecheck(CteScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecCteScan(CteScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 322abca..7162348 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -107,11 +107,11 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	return css;
 }
 
-TupleTableSlot *
+void
 ExecCustomScan(CustomScanState *node)
 {
 	Assert(node->methods->ExecCustomScan != NULL);
-	return node->methods->ExecCustomScan(node);
+	ExecReturnTuple(&node->ss.ps, node->methods->ExecCustomScan(node));
 }
 
 void
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 7d9160d..1f3e072 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -113,7 +113,7 @@ ForeignRecheck(ForeignScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecForeignScan(ForeignScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 5a0f324..5038801 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -262,7 +262,7 @@ FunctionRecheck(FunctionScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecFunctionScan(FunctionScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 1bf5a31..e4cfc44 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -126,7 +126,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
  *		the next qualifying tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecGather(GatherState *node)
 {
 	TupleTableSlot *fslot = node->funnel_slot;
@@ -207,7 +207,10 @@ ExecGather(GatherState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -232,7 +235,10 @@ ExecGather(GatherState *node)
 		 */
 		slot = gather_getnext(node);
 		if (TupIsNull(slot))
-			return NULL;
+		{
+			ExecReturnTuple(&node->ps, NULL);
+			return;
+		}
 
 		/*
 		 * form the result tuple using ExecProject(), and return it --- unless
@@ -245,11 +251,12 @@ ExecGather(GatherState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 3c066fc..f33a316 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -31,7 +31,7 @@
  *
  *		Return one tuple for each group of matching input tuples.
  */
-TupleTableSlot *
+void
 ExecGroup(GroupState *node)
 {
 	ExprContext *econtext;
@@ -44,7 +44,10 @@ ExecGroup(GroupState *node)
 	 * get state info from node
 	 */
 	if (node->grp_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ss.ps, NULL);
+		return;
+	}
 	econtext = node->ss.ps.ps_ExprContext;
 	numCols = ((Group *) node->ss.ps.plan)->numCols;
 	grpColIdx = ((Group *) node->ss.ps.plan)->grpColIdx;
@@ -61,7 +64,10 @@ ExecGroup(GroupState *node)
 
 		result = ExecProject(node->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ss.ps.ps_TupFromTlist = false;
 	}
@@ -87,7 +93,8 @@ ExecGroup(GroupState *node)
 		{
 			/* empty input, so return nothing */
 			node->grp_done = TRUE;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 		/* Copy tuple into firsttupleslot */
 		ExecCopySlot(firsttupleslot, outerslot);
@@ -115,7 +122,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
@@ -139,7 +147,8 @@ ExecGroup(GroupState *node)
 			{
 				/* no more groups, so we're done */
 				node->grp_done = TRUE;
-				return NULL;
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
 			}
 
 			/*
@@ -178,7 +187,8 @@ ExecGroup(GroupState *node)
 			if (isDone != ExprEndResult)
 			{
 				node->ss.ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-				return result;
+				ExecReturnTuple(&node->ss.ps, result);
+				return;
 			}
 		}
 		else
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8333e5c..5bc93e0 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -56,11 +56,10 @@ static void *dense_alloc(HashJoinTable hashtable, Size size);
  *		stub for pro forma compliance
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecHash(HashState *node)
 {
 	elog(ERROR, "Hash node does not support ExecProcNode call convention");
-	return NULL;
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index a7a908a..cc92fc3 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -58,7 +58,7 @@ static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
  *			  the other one is "outer".
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecHashJoin(HashJoinState *node)
 {
 	PlanState  *outerNode;
@@ -93,7 +93,10 @@ ExecHashJoin(HashJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -155,7 +158,8 @@ ExecHashJoin(HashJoinState *node)
 					if (TupIsNull(node->hj_FirstOuterTupleSlot))
 					{
 						node->hj_OuterNotEmpty = false;
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 					}
 					else
 						node->hj_OuterNotEmpty = true;
@@ -183,7 +187,10 @@ ExecHashJoin(HashJoinState *node)
 				 * outer relation.
 				 */
 				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
-					return NULL;
+				{
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
+				}
 
 				/*
 				 * need to remember whether nbatch has increased since we
@@ -323,7 +330,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -362,7 +370,8 @@ ExecHashJoin(HashJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -401,7 +410,8 @@ ExecHashJoin(HashJoinState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -414,7 +424,10 @@ ExecHashJoin(HashJoinState *node)
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecHashJoinNewBatch(node))
-					return NULL;	/* end of join */
+				{
+					ExecReturnTuple(&node->js.ps, NULL); /* end of join */
+					return;
+				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..47285a1 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -249,7 +249,7 @@ IndexOnlyRecheck(IndexOnlyScanState *node, TupleTableSlot *slot)
  *		ExecIndexOnlyScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexOnlyScan(IndexOnlyScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..6bf35d3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -482,7 +482,7 @@ reorderqueue_pop(IndexScanState *node)
  *		ExecIndexScan(node)
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecIndexScan(IndexScanState *node)
 {
 	/*
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index 97267c5..4e70183 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -36,7 +36,7 @@ static void pass_down_bound(LimitState *node, PlanState *child_node);
  *		filtering on the stream of tuples returned by a subplan.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLimit(LimitState *node)
 {
 	ScanDirection direction;
@@ -72,7 +72,10 @@ ExecLimit(LimitState *node)
 			 * If backwards scan, just return NULL without changing state.
 			 */
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Check for empty window; if so, treat like empty subplan.
@@ -80,7 +83,8 @@ ExecLimit(LimitState *node)
 			if (node->count <= 0 && !node->noCount)
 			{
 				node->lstate = LIMIT_EMPTY;
-				return NULL;
+				ExecReturnTuple(&node->ps, NULL);
+				return;
 			}
 
 			/*
@@ -96,7 +100,8 @@ ExecLimit(LimitState *node)
 					 * any output at all.
 					 */
 					node->lstate = LIMIT_EMPTY;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				if (++node->position > node->offset)
@@ -115,7 +120,8 @@ ExecLimit(LimitState *node)
 			 * The subplan is known to return no tuples (or not more than
 			 * OFFSET tuples, in general).  So we return no tuples.
 			 */
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 
 		case LIMIT_INWINDOW:
 			if (ScanDirectionIsForward(direction))
@@ -130,7 +136,8 @@ ExecLimit(LimitState *node)
 					node->position - node->offset >= node->count)
 				{
 					node->lstate = LIMIT_WINDOWEND;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -140,7 +147,8 @@ ExecLimit(LimitState *node)
 				if (TupIsNull(slot))
 				{
 					node->lstate = LIMIT_SUBPLANEOF;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 				node->subSlot = slot;
 				node->position++;
@@ -154,7 +162,8 @@ ExecLimit(LimitState *node)
 				if (node->position <= node->offset + 1)
 				{
 					node->lstate = LIMIT_WINDOWSTART;
-					return NULL;
+					ExecReturnTuple(&node->ps, NULL);
+					return;
 				}
 
 				/*
@@ -170,7 +179,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_SUBPLANEOF:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from subplan EOF, so re-fetch previous tuple; there
@@ -186,7 +198,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWEND:
 			if (ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Backing up from window end: simply re-return the last tuple
@@ -199,7 +214,10 @@ ExecLimit(LimitState *node)
 
 		case LIMIT_WINDOWSTART:
 			if (!ScanDirectionIsForward(direction))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * Advancing after having backed off window start: simply
@@ -220,7 +238,7 @@ ExecLimit(LimitState *node)
 	/* Return the current tuple */
 	Assert(!TupIsNull(slot));
 
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /*
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index c4b5333..8daa203 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -35,7 +35,7 @@
  *		ExecLockRows
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecLockRows(LockRowsState *node)
 {
 	TupleTableSlot *slot;
@@ -57,7 +57,10 @@ lnext:
 	slot = ExecProcNode(outerPlan);
 
 	if (TupIsNull(slot))
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* We don't need EvalPlanQual unless we get updated tuple version(s) */
 	epq_needed = false;
@@ -334,7 +337,7 @@ lnext:
 	}
 
 	/* Got all locks, so return the current tuple */
-	return slot;
+	ExecReturnTuple(&node->ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 82e31c1..fd3b013 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -35,7 +35,7 @@
  *
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* result tuple from subplan */
+void
 ExecMaterial(MaterialState *node)
 {
 	EState	   *estate;
@@ -93,7 +93,11 @@ ExecMaterial(MaterialState *node)
 			 * fetch.
 			 */
 			if (!tuplestore_advance(tuplestorestate, forward))
-				return NULL;	/* the tuplestore must be empty */
+			{
+				/* the tuplestore must be empty */
+				ExecReturnTuple(&node->ss.ps, NULL);
+				return;
+			}
 		}
 		eof_tuplestore = false;
 	}
@@ -105,7 +109,10 @@ ExecMaterial(MaterialState *node)
 	if (!eof_tuplestore)
 	{
 		if (tuplestore_gettupleslot(tuplestorestate, forward, false, slot))
-			return slot;
+		{
+			ExecReturnTuple(&node->ss.ps, slot);
+			return;
+		}
 		if (forward)
 			eof_tuplestore = true;
 	}
@@ -132,7 +139,8 @@ ExecMaterial(MaterialState *node)
 		if (TupIsNull(outerslot))
 		{
 			node->eof_underlying = true;
-			return NULL;
+			ExecReturnTuple(&node->ss.ps, NULL);
+			return;
 		}
 
 		/*
@@ -146,13 +154,14 @@ ExecMaterial(MaterialState *node)
 		/*
 		 * We can just return the subplan's returned tuple, without copying.
 		 */
-		return outerslot;
+		ExecReturnTuple(&node->ss.ps, outerslot);
+		return;
 	}
 
 	/*
 	 * Nothing left ...
 	 */
-	return ExecClearTuple(slot);
+	ExecReturnTuple(&node->ss.ps, ExecClearTuple(slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index ae0e8dc..3ef8120 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -164,7 +164,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
  *		Handles iteration over multiple subplans.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeAppend(MergeAppendState *node)
 {
 	TupleTableSlot *result;
@@ -214,7 +214,7 @@ ExecMergeAppend(MergeAppendState *node)
 		result = node->ms_slots[i];
 	}
 
-	return result;
+	ExecReturnTuple(&node->ps, result);
 }
 
 /*
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index cd8d6c6..d73d9f4 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -615,7 +615,7 @@ ExecMergeTupleDump(MergeJoinState *mergestate)
  *		ExecMergeJoin
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecMergeJoin(MergeJoinState *node)
 {
 	List	   *joinqual;
@@ -653,7 +653,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -710,7 +713,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillOuter(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -728,7 +734,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -765,7 +772,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 							result = MJFillInner(node);
 							if (result)
-								return result;
+							{
+								ExecReturnTuple(&node->js.ps, result);
+								return;
+							}
 						}
 						break;
 					case MJEVAL_ENDOFJOIN:
@@ -785,7 +795,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -868,7 +879,8 @@ ExecMergeJoin(MergeJoinState *node)
 						{
 							node->js.ps.ps_TupFromTlist =
 								(isDone == ExprMultipleResult);
-							return result;
+							ExecReturnTuple(&node->js.ps, result);
+							return;
 						}
 					}
 					else
@@ -901,7 +913,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1003,7 +1018,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1039,7 +1057,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1174,7 +1193,8 @@ ExecMergeJoin(MergeJoinState *node)
 								break;
 							}
 							/* Otherwise we're done. */
-							return NULL;
+							ExecReturnTuple(&node->js.ps, NULL);
+							return;
 					}
 				}
 				break;
@@ -1256,7 +1276,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1292,7 +1315,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1318,7 +1342,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1362,7 +1389,8 @@ ExecMergeJoin(MergeJoinState *node)
 							break;
 						}
 						/* Otherwise we're done. */
-						return NULL;
+						ExecReturnTuple(&node->js.ps, NULL);
+						return;
 				}
 				break;
 
@@ -1388,7 +1416,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillInner(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/* Mark before advancing, if wanted */
@@ -1406,7 +1437,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(innerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of inner subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDOUTER state and process next tuple. */
@@ -1434,7 +1466,10 @@ ExecMergeJoin(MergeJoinState *node)
 
 					result = MJFillOuter(node);
 					if (result)
-						return result;
+					{
+						ExecReturnTuple(&node->js.ps, result);
+						return;
+					}
 				}
 
 				/*
@@ -1448,7 +1483,8 @@ ExecMergeJoin(MergeJoinState *node)
 				if (TupIsNull(outerTupleSlot))
 				{
 					MJ_printf("ExecMergeJoin: end of outer subplan\n");
-					return NULL;
+					ExecReturnTuple(&node->js.ps, NULL);
+					return;
 				}
 
 				/* Else remain in ENDINNER state and process next tuple. */
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95cc2c6..0e05d4d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1298,7 +1298,7 @@ fireASTriggers(ModifyTableState *node)
  *		if needed.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecModifyTable(ModifyTableState *node)
 {
 	EState	   *estate = node->ps.state;
@@ -1333,7 +1333,10 @@ ExecModifyTable(ModifyTableState *node)
 	 * extra times.
 	 */
 	if (node->mt_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/*
 	 * On first call, fire BEFORE STATEMENT triggers before proceeding.
@@ -1411,7 +1414,8 @@ ExecModifyTable(ModifyTableState *node)
 			slot = ExecProcessReturning(resultRelInfo, NULL, planSlot);
 
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 
 		EvalPlanQualSetSlot(&node->mt_epqstate, planSlot);
@@ -1517,7 +1521,8 @@ ExecModifyTable(ModifyTableState *node)
 		if (slot)
 		{
 			estate->es_result_relation_info = saved_resultRelInfo;
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 	}
 
@@ -1531,7 +1536,7 @@ ExecModifyTable(ModifyTableState *node)
 
 	node->mt_done = true;
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 1895b60..54eff56 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -56,7 +56,7 @@
  *			   are prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecNestLoop(NestLoopState *node)
 {
 	NestLoop   *nl;
@@ -93,7 +93,10 @@ ExecNestLoop(NestLoopState *node)
 
 		result = ExecProject(node->js.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&node->js.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->js.ps.ps_TupFromTlist = false;
 	}
@@ -128,7 +131,8 @@ ExecNestLoop(NestLoopState *node)
 			if (TupIsNull(outerTupleSlot))
 			{
 				ENL1_printf("no outer tuple, ending join");
-				return NULL;
+				ExecReturnTuple(&node->js.ps, NULL);
+				return;
 			}
 
 			ENL1_printf("saving new outer tuple information");
@@ -212,7 +216,8 @@ ExecNestLoop(NestLoopState *node)
 					{
 						node->js.ps.ps_TupFromTlist =
 							(isDone == ExprMultipleResult);
-						return result;
+						ExecReturnTuple(&node->js.ps, result);
+						return;
 					}
 				}
 				else
@@ -270,7 +275,8 @@ ExecNestLoop(NestLoopState *node)
 				{
 					node->js.ps.ps_TupFromTlist =
 						(isDone == ExprMultipleResult);
-					return result;
+					ExecReturnTuple(&node->js.ps, result);
+					return;
 				}
 			}
 			else
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index 627370f..8a29df8 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -72,7 +72,7 @@ build_hash_table(RecursiveUnionState *rustate)
  * 2.6 go back to 2.2
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecRecursiveUnion(RecursiveUnionState *node)
 {
 	PlanState  *outerPlan = outerPlanState(node);
@@ -102,7 +102,8 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 			/* Each non-duplicate tuple goes to the working table ... */
 			tuplestore_puttupleslot(node->working_table, slot);
 			/* ... and to the caller */
-			return slot;
+			ExecReturnTuple(&node->ps, slot);
+			return;
 		}
 		node->recursing = true;
 	}
@@ -151,10 +152,11 @@ ExecRecursiveUnion(RecursiveUnionState *node)
 		node->intermediate_empty = false;
 		tuplestore_puttupleslot(node->intermediate_table, slot);
 		/* ... and return it */
-		return slot;
+		ExecReturnTuple(&node->ps, slot);
+		return;
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 0d2de14..a830ffd 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -63,7 +63,7 @@
  *		'nil' if the constant qualification is not satisfied.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecResult(ResultState *node)
 {
 	TupleTableSlot *outerTupleSlot;
@@ -87,7 +87,8 @@ ExecResult(ResultState *node)
 		if (!qualResult)
 		{
 			node->rs_done = true;
-			return NULL;
+			ExecReturnTuple(&node->ps, NULL);
+			return;
 		}
 	}
 
@@ -100,7 +101,10 @@ ExecResult(ResultState *node)
 	{
 		resultSlot = ExecProject(node->ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return resultSlot;
+		{
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
+		}
 		/* Done with that source tuple... */
 		node->ps.ps_TupFromTlist = false;
 	}
@@ -130,7 +134,10 @@ ExecResult(ResultState *node)
 			outerTupleSlot = ExecProcNode(outerPlan);
 
 			if (TupIsNull(outerTupleSlot))
-				return NULL;
+			{
+				ExecReturnTuple(&node->ps, NULL);
+				return;
+			}
 
 			/*
 			 * prepare to compute projection expressions, which will expect to
@@ -157,11 +164,12 @@ ExecResult(ResultState *node)
 		if (isDone != ExprEndResult)
 		{
 			node->ps.ps_TupFromTlist = (isDone == ExprMultipleResult);
-			return resultSlot;
+			ExecReturnTuple(&node->ps, resultSlot);
+			return;
 		}
 	}
 
-	return NULL;
+	ExecReturnTuple(&node->ps, NULL);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 9ce7c02..89cce0e 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -95,7 +95,7 @@ SampleRecheck(SampleScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSampleScan(SampleScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 00bf3a5..0ca86d9 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -121,7 +121,7 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSeqScan(SeqScanState *node)
 {
 	return ExecScan((ScanState *) node,
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 8b05795..fe12631 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -191,7 +191,7 @@ set_output_count(SetOpState *setopstate, SetOpStatePerGroup pergroup)
  *		ExecSetOp
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecSetOp(SetOpState *node)
 {
 	SetOp	   *plannode = (SetOp *) node->ps.plan;
@@ -204,22 +204,26 @@ ExecSetOp(SetOpState *node)
 	if (node->numOutput > 0)
 	{
 		node->numOutput--;
-		return resultTupleSlot;
+		ExecReturnTuple(&node->ps, resultTupleSlot);
+		return;
 	}
 
 	/* Otherwise, we're done if we are out of groups */
 	if (node->setop_done)
-		return NULL;
+	{
+		ExecReturnTuple(&node->ps, NULL);
+		return;
+	}
 
 	/* Fetch the next tuple group according to the correct strategy */
 	if (plannode->strategy == SETOP_HASHED)
 	{
 		if (!node->table_filled)
 			setop_fill_hash_table(node);
-		return setop_retrieve_hash_table(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_hash_table(node));
 	}
 	else
-		return setop_retrieve_direct(node);
+		ExecReturnTuple(&node->ps, setop_retrieve_direct(node));
 }
 
 /*
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 0286a7f..13f721a 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -35,7 +35,7 @@
  *		  -- the outer child is prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSort(SortState *node)
 {
 	EState	   *estate;
@@ -138,7 +138,7 @@ ExecSort(SortState *node)
 	(void) tuplesort_gettupleslot(tuplesortstate,
 								  ScanDirectionIsForward(dir),
 								  slot, NULL);
-	return slot;
+	ExecReturnTuple(&node->ss.ps, slot);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index cb007a5..0562926 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -79,7 +79,7 @@ SubqueryRecheck(SubqueryScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecSubqueryScan(SubqueryScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index 2604103..e2a0479 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -387,7 +387,7 @@ TidRecheck(TidScanState *node, TupleTableSlot *slot)
  *		  -- tidPtr is -1.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecTidScan(TidScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 3b89e84..ac89323 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -42,7 +42,7 @@
  *		ExecUnique
  * ----------------------------------------------------------------
  */
-TupleTableSlot *				/* return: a tuple or NULL */
+void
 ExecUnique(UniqueState *node)
 {
 	Unique	   *plannode = (Unique *) node->ps.plan;
@@ -70,8 +70,8 @@ ExecUnique(UniqueState *node)
 		if (TupIsNull(slot))
 		{
 			/* end of subplan, so we're done */
-			ExecClearTuple(resultTupleSlot);
-			return NULL;
+			ExecReturnTuple(&node->ps, ExecClearTuple(resultTupleSlot));
+			return;
 		}
 
 		/*
@@ -98,7 +98,7 @@ ExecUnique(UniqueState *node)
 	 * won't guarantee that this source tuple is still accessible after
 	 * fetching the next source tuple.
 	 */
-	return ExecCopySlot(resultTupleSlot, slot);
+	ExecReturnTuple(&node->ps, ExecCopySlot(resultTupleSlot, slot));
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 9c03f8a..3e6c321 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -186,7 +186,7 @@ ValuesRecheck(ValuesScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecValuesScan(ValuesScanState *node)
 {
 	return ExecScan(&node->ss,
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index f12fe26..16c02f8 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1555,7 +1555,7 @@ update_frametailpos(WindowObject winobj, TupleTableSlot *slot)
  *	(ignoring the case of SRFs in the targetlist, that is).
  * -----------------
  */
-TupleTableSlot *
+void
 ExecWindowAgg(WindowAggState *winstate)
 {
 	TupleTableSlot *result;
@@ -1565,7 +1565,10 @@ ExecWindowAgg(WindowAggState *winstate)
 	int			numfuncs;
 
 	if (winstate->all_done)
-		return NULL;
+	{
+		ExecReturnTuple(&winstate->ss.ps, NULL);
+		return;
+	}
 
 	/*
 	 * Check to see if we're still projecting out tuples from a previous
@@ -1579,7 +1582,10 @@ ExecWindowAgg(WindowAggState *winstate)
 
 		result = ExecProject(winstate->ss.ps.ps_ProjInfo, &isDone);
 		if (isDone == ExprMultipleResult)
-			return result;
+		{
+			ExecReturnTuple(&winstate->ss.ps, result);
+			return;
+		}
 		/* Done with that source tuple... */
 		winstate->ss.ps.ps_TupFromTlist = false;
 	}
@@ -1687,7 +1693,8 @@ restart:
 		else
 		{
 			winstate->all_done = true;
-			return NULL;
+			ExecReturnTuple(&winstate->ss.ps, NULL);
+			return;
 		}
 	}
 
@@ -1753,7 +1760,7 @@ restart:
 
 	winstate->ss.ps.ps_TupFromTlist =
 		(isDone == ExprMultipleResult);
-	return result;
+	ExecReturnTuple(&winstate->ss.ps, result);
 }
 
 /* -----------------
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index cfed6e6..c3615b2 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -77,7 +77,7 @@ WorkTableScanRecheck(WorkTableScanState *node, TupleTableSlot *slot)
  *		access method functions.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
+void
 ExecWorkTableScan(WorkTableScanState *node)
 {
 	/*
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 28c0c2e..1eb09d8 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -228,6 +228,15 @@ extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
 
+/* Convenience function to set a node's result to a TupleTableSlot. */
+static inline void
+ExecReturnTuple(PlanState *node, TupleTableSlot *slot)
+{
+	Assert(!node->result_ready);
+	node->result = (Node *) slot;
+	node->result_ready = true;
+}
+
 /*
  * prototypes from functions in execQual.c
  */
@@ -256,7 +265,7 @@ extern TupleTableSlot *ExecProject(ProjectionInfo *projInfo,
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
 
-extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
+extern void ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 		 ExecScanRecheckMtd recheckMtd);
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, Index varno);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 54c75e8..b86ec6a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAgg(AggState *node);
+extern void ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/executor/nodeAppend.h b/src/include/executor/nodeAppend.h
index 51c381e..70a6b62 100644
--- a/src/include/executor/nodeAppend.h
+++ b/src/include/executor/nodeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern AppendState *ExecInitAppend(Append *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAppend(AppendState *node);
+extern void ExecAppend(AppendState *node);
 extern void ExecEndAppend(AppendState *node);
 extern void ExecReScanAppend(AppendState *node);
 
diff --git a/src/include/executor/nodeBitmapHeapscan.h b/src/include/executor/nodeBitmapHeapscan.h
index 0ed9c78..069dbc7 100644
--- a/src/include/executor/nodeBitmapHeapscan.h
+++ b/src/include/executor/nodeBitmapHeapscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern BitmapHeapScanState *ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecBitmapHeapScan(BitmapHeapScanState *node);
+extern void ExecBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecEndBitmapHeapScan(BitmapHeapScanState *node);
 extern void ExecReScanBitmapHeapScan(BitmapHeapScanState *node);
 
diff --git a/src/include/executor/nodeCtescan.h b/src/include/executor/nodeCtescan.h
index ef5c2bc..8411fa1 100644
--- a/src/include/executor/nodeCtescan.h
+++ b/src/include/executor/nodeCtescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern CteScanState *ExecInitCteScan(CteScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecCteScan(CteScanState *node);
+extern void ExecCteScan(CteScanState *node);
 extern void ExecEndCteScan(CteScanState *node);
 extern void ExecReScanCteScan(CteScanState *node);
 
diff --git a/src/include/executor/nodeCustom.h b/src/include/executor/nodeCustom.h
index 7d16c2b..5df2ebb 100644
--- a/src/include/executor/nodeCustom.h
+++ b/src/include/executor/nodeCustom.h
@@ -21,7 +21,7 @@
  */
 extern CustomScanState *ExecInitCustomScan(CustomScan *custom_scan,
 				   EState *estate, int eflags);
-extern TupleTableSlot *ExecCustomScan(CustomScanState *node);
+extern void ExecCustomScan(CustomScanState *node);
 extern void ExecEndCustomScan(CustomScanState *node);
 
 extern void ExecReScanCustomScan(CustomScanState *node);
diff --git a/src/include/executor/nodeForeignscan.h b/src/include/executor/nodeForeignscan.h
index 0cdec4e..3d0f7bd 100644
--- a/src/include/executor/nodeForeignscan.h
+++ b/src/include/executor/nodeForeignscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern ForeignScanState *ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecForeignScan(ForeignScanState *node);
+extern void ExecForeignScan(ForeignScanState *node);
 extern void ExecEndForeignScan(ForeignScanState *node);
 extern void ExecReScanForeignScan(ForeignScanState *node);
 
diff --git a/src/include/executor/nodeFunctionscan.h b/src/include/executor/nodeFunctionscan.h
index d6e7a61..15beb13 100644
--- a/src/include/executor/nodeFunctionscan.h
+++ b/src/include/executor/nodeFunctionscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern FunctionScanState *ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecFunctionScan(FunctionScanState *node);
+extern void ExecFunctionScan(FunctionScanState *node);
 extern void ExecEndFunctionScan(FunctionScanState *node);
 extern void ExecReScanFunctionScan(FunctionScanState *node);
 
diff --git a/src/include/executor/nodeGather.h b/src/include/executor/nodeGather.h
index f76d9be..100a827 100644
--- a/src/include/executor/nodeGather.h
+++ b/src/include/executor/nodeGather.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GatherState *ExecInitGather(Gather *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGather(GatherState *node);
+extern void ExecGather(GatherState *node);
 extern void ExecEndGather(GatherState *node);
 extern void ExecShutdownGather(GatherState *node);
 extern void ExecReScanGather(GatherState *node);
diff --git a/src/include/executor/nodeGroup.h b/src/include/executor/nodeGroup.h
index 92639f5..446ded5 100644
--- a/src/include/executor/nodeGroup.h
+++ b/src/include/executor/nodeGroup.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern GroupState *ExecInitGroup(Group *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecGroup(GroupState *node);
+extern void ExecGroup(GroupState *node);
 extern void ExecEndGroup(GroupState *node);
 extern void ExecReScanGroup(GroupState *node);
 
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index 8cf6d15..b395fd9 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHash(HashState *node);
+extern void ExecHash(HashState *node);
 extern Node *MultiExecHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index f24127a..072c610 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -18,7 +18,7 @@
 #include "storage/buffile.h"
 
 extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+extern void ExecHashJoin(HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
diff --git a/src/include/executor/nodeIndexonlyscan.h b/src/include/executor/nodeIndexonlyscan.h
index d63d194..0fbcf80 100644
--- a/src/include/executor/nodeIndexonlyscan.h
+++ b/src/include/executor/nodeIndexonlyscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexOnlyScanState *ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexOnlyScan(IndexOnlyScanState *node);
+extern void ExecIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecEndIndexOnlyScan(IndexOnlyScanState *node);
 extern void ExecIndexOnlyMarkPos(IndexOnlyScanState *node);
 extern void ExecIndexOnlyRestrPos(IndexOnlyScanState *node);
diff --git a/src/include/executor/nodeIndexscan.h b/src/include/executor/nodeIndexscan.h
index 194fadb..341dab3 100644
--- a/src/include/executor/nodeIndexscan.h
+++ b/src/include/executor/nodeIndexscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern IndexScanState *ExecInitIndexScan(IndexScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecIndexScan(IndexScanState *node);
+extern void ExecIndexScan(IndexScanState *node);
 extern void ExecEndIndexScan(IndexScanState *node);
 extern void ExecIndexMarkPos(IndexScanState *node);
 extern void ExecIndexRestrPos(IndexScanState *node);
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 96166b4..03dde30 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLimit(LimitState *node);
+extern void ExecLimit(LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/executor/nodeLockRows.h b/src/include/executor/nodeLockRows.h
index e828e9c..eda3cbec 100644
--- a/src/include/executor/nodeLockRows.h
+++ b/src/include/executor/nodeLockRows.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern LockRowsState *ExecInitLockRows(LockRows *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLockRows(LockRowsState *node);
+extern void ExecLockRows(LockRowsState *node);
 extern void ExecEndLockRows(LockRowsState *node);
 extern void ExecReScanLockRows(LockRowsState *node);
 
diff --git a/src/include/executor/nodeMaterial.h b/src/include/executor/nodeMaterial.h
index 2b8cae1..20bc7f6 100644
--- a/src/include/executor/nodeMaterial.h
+++ b/src/include/executor/nodeMaterial.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MaterialState *ExecInitMaterial(Material *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMaterial(MaterialState *node);
+extern void ExecMaterial(MaterialState *node);
 extern void ExecEndMaterial(MaterialState *node);
 extern void ExecMaterialMarkPos(MaterialState *node);
 extern void ExecMaterialRestrPos(MaterialState *node);
diff --git a/src/include/executor/nodeMergeAppend.h b/src/include/executor/nodeMergeAppend.h
index 0efc489..e43b5e6 100644
--- a/src/include/executor/nodeMergeAppend.h
+++ b/src/include/executor/nodeMergeAppend.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeAppendState *ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeAppend(MergeAppendState *node);
+extern void ExecMergeAppend(MergeAppendState *node);
 extern void ExecEndMergeAppend(MergeAppendState *node);
 extern void ExecReScanMergeAppend(MergeAppendState *node);
 
diff --git a/src/include/executor/nodeMergejoin.h b/src/include/executor/nodeMergejoin.h
index 74d691c..dfdbc1b 100644
--- a/src/include/executor/nodeMergejoin.h
+++ b/src/include/executor/nodeMergejoin.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern MergeJoinState *ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecMergeJoin(MergeJoinState *node);
+extern void ExecMergeJoin(MergeJoinState *node);
 extern void ExecEndMergeJoin(MergeJoinState *node);
 extern void ExecReScanMergeJoin(MergeJoinState *node);
 
diff --git a/src/include/executor/nodeModifyTable.h b/src/include/executor/nodeModifyTable.h
index 6b66353..fe67248 100644
--- a/src/include/executor/nodeModifyTable.h
+++ b/src/include/executor/nodeModifyTable.h
@@ -16,7 +16,7 @@
 #include "nodes/execnodes.h"
 
 extern ModifyTableState *ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecModifyTable(ModifyTableState *node);
+extern void ExecModifyTable(ModifyTableState *node);
 extern void ExecEndModifyTable(ModifyTableState *node);
 extern void ExecReScanModifyTable(ModifyTableState *node);
 
diff --git a/src/include/executor/nodeNestloop.h b/src/include/executor/nodeNestloop.h
index eeb42d6..cab1885 100644
--- a/src/include/executor/nodeNestloop.h
+++ b/src/include/executor/nodeNestloop.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern NestLoopState *ExecInitNestLoop(NestLoop *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecNestLoop(NestLoopState *node);
+extern void ExecNestLoop(NestLoopState *node);
 extern void ExecEndNestLoop(NestLoopState *node);
 extern void ExecReScanNestLoop(NestLoopState *node);
 
diff --git a/src/include/executor/nodeRecursiveunion.h b/src/include/executor/nodeRecursiveunion.h
index 1c08790..fb11eca 100644
--- a/src/include/executor/nodeRecursiveunion.h
+++ b/src/include/executor/nodeRecursiveunion.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern RecursiveUnionState *ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecRecursiveUnion(RecursiveUnionState *node);
+extern void ExecRecursiveUnion(RecursiveUnionState *node);
 extern void ExecEndRecursiveUnion(RecursiveUnionState *node);
 extern void ExecReScanRecursiveUnion(RecursiveUnionState *node);
 
diff --git a/src/include/executor/nodeResult.h b/src/include/executor/nodeResult.h
index 356027f..951fae6 100644
--- a/src/include/executor/nodeResult.h
+++ b/src/include/executor/nodeResult.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ResultState *ExecInitResult(Result *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecResult(ResultState *node);
+extern void ExecResult(ResultState *node);
 extern void ExecEndResult(ResultState *node);
 extern void ExecResultMarkPos(ResultState *node);
 extern void ExecResultRestrPos(ResultState *node);
diff --git a/src/include/executor/nodeSamplescan.h b/src/include/executor/nodeSamplescan.h
index c8f03d8..4ab6e5a 100644
--- a/src/include/executor/nodeSamplescan.h
+++ b/src/include/executor/nodeSamplescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SampleScanState *ExecInitSampleScan(SampleScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSampleScan(SampleScanState *node);
+extern void ExecSampleScan(SampleScanState *node);
 extern void ExecEndSampleScan(SampleScanState *node);
 extern void ExecReScanSampleScan(SampleScanState *node);
 
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index f2e61ff..816d1a5 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -18,7 +18,7 @@
 #include "nodes/execnodes.h"
 
 extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern void ExecSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
diff --git a/src/include/executor/nodeSetOp.h b/src/include/executor/nodeSetOp.h
index c6e9603..dd88afb 100644
--- a/src/include/executor/nodeSetOp.h
+++ b/src/include/executor/nodeSetOp.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SetOpState *ExecInitSetOp(SetOp *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSetOp(SetOpState *node);
+extern void ExecSetOp(SetOpState *node);
 extern void ExecEndSetOp(SetOpState *node);
 extern void ExecReScanSetOp(SetOpState *node);
 
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 481065f..f65037d 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSort(SortState *node);
+extern void ExecSort(SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/executor/nodeSubqueryscan.h b/src/include/executor/nodeSubqueryscan.h
index 427699b..a3962c7 100644
--- a/src/include/executor/nodeSubqueryscan.h
+++ b/src/include/executor/nodeSubqueryscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern SubqueryScanState *ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSubqueryScan(SubqueryScanState *node);
+extern void ExecSubqueryScan(SubqueryScanState *node);
 extern void ExecEndSubqueryScan(SubqueryScanState *node);
 extern void ExecReScanSubqueryScan(SubqueryScanState *node);
 
diff --git a/src/include/executor/nodeTidscan.h b/src/include/executor/nodeTidscan.h
index 76c2a9f..5b7bbfd 100644
--- a/src/include/executor/nodeTidscan.h
+++ b/src/include/executor/nodeTidscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern TidScanState *ExecInitTidScan(TidScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecTidScan(TidScanState *node);
+extern void ExecTidScan(TidScanState *node);
 extern void ExecEndTidScan(TidScanState *node);
 extern void ExecReScanTidScan(TidScanState *node);
 
diff --git a/src/include/executor/nodeUnique.h b/src/include/executor/nodeUnique.h
index aa8491d..b53a553 100644
--- a/src/include/executor/nodeUnique.h
+++ b/src/include/executor/nodeUnique.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern UniqueState *ExecInitUnique(Unique *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecUnique(UniqueState *node);
+extern void ExecUnique(UniqueState *node);
 extern void ExecEndUnique(UniqueState *node);
 extern void ExecReScanUnique(UniqueState *node);
 
diff --git a/src/include/executor/nodeValuesscan.h b/src/include/executor/nodeValuesscan.h
index 026f261..90288fc 100644
--- a/src/include/executor/nodeValuesscan.h
+++ b/src/include/executor/nodeValuesscan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern ValuesScanState *ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecValuesScan(ValuesScanState *node);
+extern void ExecValuesScan(ValuesScanState *node);
 extern void ExecEndValuesScan(ValuesScanState *node);
 extern void ExecReScanValuesScan(ValuesScanState *node);
 
diff --git a/src/include/executor/nodeWindowAgg.h b/src/include/executor/nodeWindowAgg.h
index 94ed037..f5e2c98 100644
--- a/src/include/executor/nodeWindowAgg.h
+++ b/src/include/executor/nodeWindowAgg.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WindowAggState *ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWindowAgg(WindowAggState *node);
+extern void ExecWindowAgg(WindowAggState *node);
 extern void ExecEndWindowAgg(WindowAggState *node);
 extern void ExecReScanWindowAgg(WindowAggState *node);
 
diff --git a/src/include/executor/nodeWorktablescan.h b/src/include/executor/nodeWorktablescan.h
index 217208a..7b1eecb 100644
--- a/src/include/executor/nodeWorktablescan.h
+++ b/src/include/executor/nodeWorktablescan.h
@@ -17,7 +17,7 @@
 #include "nodes/execnodes.h"
 
 extern WorkTableScanState *ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecWorkTableScan(WorkTableScanState *node);
+extern void ExecWorkTableScan(WorkTableScanState *node);
 extern void ExecEndWorkTableScan(WorkTableScanState *node);
 extern void ExecReScanWorkTableScan(WorkTableScanState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4b18436..ff6c453 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1031,6 +1031,8 @@ typedef struct PlanState
 								 * top-level plan */
 
 	struct PlanState *parent;	/* node which will receive tuples from us */
+	bool		result_ready;	/* true if result is ready */
+	Node	   *result;			/* result, most often TupleTableSlot */
 
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
-- 
2.9.2

#54Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#53)
Re: asynchronous and vectorized execution

No, it was wrong.

At Mon, 29 Aug 2016 17:08:36 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160829.170836.161449399.horiguchi.kyotaro@lab.ntt.co.jp>

Hello,

I considered applying the async infrastructure onto nodeGather,
but since parallel workers hardly make Gather (or the leader)
wait, it's really useless at least for simple cases. Furthermore,
as several people may have said before, being defferent from
foreign scans, gather (or other kinds of parallel) nodes usually
have several workers and will have up to two digit nubmers at the
most even on so-called many-core boxes. I finally gave up
applying this to nodeGather.

I overlooked that local scan takes place instead of waiting
workers to be ready. I will reconsider counting that..

As the result, the attached patchset is functionally the same
with the last version but replace misused Assert with
AssertMacro.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#54)
Re: asynchronous and vectorized execution

This is random thoughts on this patch.

At Tue, 30 Aug 2016 12:17:52 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160830.121752.100817694.horiguchi.kyotaro@lab.ntt.co.jp>

As the result, the attached patchset is functionally the same
with the last version but replace misused Assert with
AssertMacro.

There's perfomance degradation for non-asynchronous nodes, as
shown as 't0' below.

The patch adds two "if-then" and one additional function call as
asynchronous stuff into ExecProcnode, which is heavily passed and
foremerly consists only five meaningful lines. The stuff slows
performance by about 1% for simple seqscan case. The following is
the performance numbers previously shown upthread. (Or the
difference might be too small to get meaningful performance
difference..)

===
t0- (SeqScan()) (2 parallel)
pl- (Append(4 * SeqScan()))
pf0 (Append(4 * ForeignScan())) all ForeignScans are on the same connection.
pf1 (Append(4 * ForeignScan())) all ForeignScans have their own connections.

patched-O2 time(ms) stddev(ms) gain from unpatched (%)
t0 4121.27 1.1 -1.44
pl 1757.41 0.94 -1.73
pf0 6458.99 192.4 20.26
pf1 1747.4 24.81 78.39

unpatched-O2
t0 4062.6 1.95
pl 1727.45 9.41
pf0 8100.47 24.51
pf1 8086.52 33.53
===

So, finally, it seems that the infrastructure should not habit in
ExecProcNode, or need to redesign the executor. I tried
jump-table to dispatch nodes which was in vain. Having a flag in
EState may be able to get rid of async stuff from non-async
route. (similar to, but maybe different from my first patch) JIT
compiling seems promising but it is a different thing.

As for nodeGather, it expects the leader process to be one of
workers, the leader should be free from it so as to behave as an
async node. But still the expectected number of workers seems to
be too small to take a meaningful benefit from async execution.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#56Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Kyotaro HORIGUCHI (#55)
1 attachment(s)
Re: asynchronous and vectorized execution

Hello,

At Thu, 01 Sep 2016 16:12:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in <20160901.161231.110068639.horiguchi.kyotaro@lab.ntt.co.jp>

There's perfomance degradation for non-asynchronous nodes, as
shown as 't0' below.

The patch adds two "if-then" and one additional function call as
asynchronous stuff into ExecProcnode, which is heavily passed and
foremerly consists only five meaningful lines. The stuff slows
performance by about 1% for simple seqscan case. The following is
the performance numbers previously shown upthread. (Or the
difference might be too small to get meaningful performance
difference..)

I tried __builtin_expect before moving the stuff out of
execProcNode. (attached patch) I found a conversation about the
pragma in past discussion.

/messages/by-id/CA+TgmoYknejCgWMb8Tg63qA67JoUG2uCc0DZc5mm9=e18AmigA@mail.gmail.com

If we can show cases where it reliably produces a significant
speedup, then I would think it would be worthwhile

I got a result as the followings.

master(67e1e2a)-O2
time(ms) stddev(ms)
t0: 3928.22 ( 0.40) # Simple SeqScan only
pl: 1665.14 ( 0.53) # Append(SeqScan)

Patched-O2 / NOT Use __builtin_expect
t0: 4042.69 ( 0.92) degradation to master is 2.9%
pl: 1698.46 ( 0.44) degradation to master is 2.0%

Patched-O2 / Use __builtin_expect
t0: 3886.69 ( 1.93) *gain* to master is 1.06%
pl: 1671.66 ( 0.67) degradation to master is 0.39%

I haven't directly seen the pragmra's implication for
optimization on surrounding code but I suspect there's some
implication. I also tried the pragma to ExecAppend but no
difference seen. The numbers flucture easily by any changes in
the machine's state so the lower digits aren't trustworthy but
several succeeding repetitions showed fluctuations up to some
milliseconds.

execProcNode will be allowed to be as it is if __builtin_expect
is usable but ExecAppend still needs an improvement.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Use-__builtin_expect-to-optimize-branches.patchtext/x-patch; charset=us-asciiDownload
From f1f02557f7f4d694f0e3b4d62a6bdfad8e746b03 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Mon, 12 Sep 2016 16:36:37 +0900
Subject: [PATCH] Use __builtin_expect to optimize branches

__builtin_expect can minimize the cost of failure of branch prediction
for certain cases. It seems to work very fine here.
---
 src/backend/executor/execProcnode.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index cef262b..c247fa3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -585,13 +585,22 @@ ExecExecuteNode(PlanState *node)
  *      execution subtree and every subtree should have an individual context.
  *      ----------------------------------------------------------------
  */
+#define MY_USE_LIKELY
+#if defined MY_USE_LIKELY
+#define my_likely(x) __builtin_expect((x),1)
+#define my_unlikely(x) __builtin_expect((x),0)
+#else
+#define my_likely(x) (x)
+#define my_unlikely(x) (x)
+#endif
+
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
 	CHECK_FOR_INTERRUPTS();
 
 	/* Return unconsumed result if any */
-	if (node->result_ready)
+	if (my_unlikely(node->result_ready))
 		return ExecConsumeResult(node);
 
 	if (node->chgParam != NULL) /* something changed */
@@ -599,7 +608,7 @@ ExecProcNode(PlanState *node)
 
 	ExecDispatchNode(node);
 
-	if (!node->result_ready)
+	if (my_unlikely(!node->result_ready))
 		ExecAsyncWaitForNode(node);
 
 	return ExecConsumeResult(node);
-- 
2.9.2

#57Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro HORIGUCHI (#53)
Re: asynchronous and vectorized execution

On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

[ new patches ]

+            /*
+             * We assume that few nodes are async-aware and async-unaware
+             * nodes cannot be revserse-dispatched from lower nodes that is
+             * async-aware. Firing of an async node that is not a descendant
+             * of the planstate will cause such reverse-diaptching to
+             * async-aware nodes, which is unexpected behavior for them.
+             *
+             * For instance, consider an async-unaware Hashjoin(OUTER, INNER)
+             * where the OUTER is running asynchronously but the Hashjoin is
+             * waiting on the async INNER during inner-hash creation. If the
+             * OUTER fires for the case, since anyone is waiting on it,
+             * ExecAsyncWaitForNode finally dispatches to the Hashjoin which
+             * is now in the middle of thing its work.
+             */
+            if (!IsParent(planstate, node))
+                continue;

I'm not entirely sure that I understand this comment, but I don't
think it's going in the right direction. Let's start with the example
in the second paragraph. If the hash join is async-unaware, then it
isn't possible for the hash join to be both running the outer side of
the join asynchronously and at the same time waiting on the inner
side. Once it tries to pull the first tuple from the outer side, it's
waiting for that to finish and can't do anything else. So, the inner
side can't possibly get touched in any way until the outer side
finishes. For anything else to happen, the hash join would have to be
async-aware. Even if we did that, I don't think it would be right to
kick off both sides of the hash join at the same time. Right now, if
the outer side turns out to be empty, we never need to build the hash
table, and that's good.

I don't think it's a good idea to wait for only nodes that are in the
current subtree. For example, consider a plan like this:

Append
-> Foreign Scan on a
-> Hash Join
-> Foreign Scan on b
-> Hash
-> Seq Scan on x

Suppose Append and Foreign Scan are parallel-aware but the other nodes
are not. Append kicks off the Foreign Scan on a and then waits for
the hash join to produce a tuple; the hash join kicks off the Foreign
Scan on b and waits for it to return a tuple. If, while we're waiting
for the foreign scan on b, the foreign scan on a needs some attention
- either to produce tuples, or maybe just to call PQconsumeInput() so
that more data can be sent from the other side, I think we need to be
able to do that. There's no real problem here; even if the Append
becomes result-ready before the hash join returns, that is fine. We
will not actually be able to return from the append until the hash
join returns because of what's on the call stack, but that doesn't
mean that the Append can't be marked result-ready sooner than that.
The situation can be improved by making the hash join node
parallel-aware, but if we don't do that it's still not broken.

I think the reason that you originally got backed into this design was
because of problems with reentrancy. I don't think I really
understand in any detail exactly what problem you hit there, but it
seems to me that the following problem could occur:
ExecAsyncWaitForNode finds two events and schedules two callbacks. It
calls the first of those two callbacks. Before that callback returns,
it again calls ExecAsyncWaitForNode. But the new invocation of
ExecAsyncWaitForNode doesn't know that there is a second callback
pending, so it somehow gets confused. However, I think this problem
can fixed using a different method. The occurred_event and callbacks
arrays defined by ExecAsyncWaitForNode can be made part of the EState
rather than being local variables. When ExecAsyncWaitForNode is
called, it checks whether there are any pending callbacks; if so, it
removes and calls the first one. Only if there are no pending
callbacks does it actually wait; when a wait event fires, one or more
new callbacks are generated. This is very similar to the reason why
ReceiveSharedInvalidMessages uses a static messages array rather than
a normal local variable. That function is solving a problem which I
suspect is very similar to the one we have here. However, it would be
helpful if you could provide some more details on what you think the
reentrancy problems are, because I'm not able to figure them out from
your messages so far.

The most mysterious part of this hunk to me is the comment that
"Firing of an async node that is not a descendant of the planstate
will cause such reverse-diaptching to async-aware nodes, which is
unexpected behavior for them." It is the async-unaware nodes which
might have a problem. The nodes that have been taught about the new
system should know what to expect.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#58Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro HORIGUCHI (#52)
Re: asynchronous and vectorized execution

On Tue, Aug 2, 2016 at 3:41 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Thank you for the comment.

At Mon, 1 Aug 2016 10:44:56 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in <CAJ3gD9ek4Y4SGTSuc_pzkGYwLMbrc9QOM7m1D8bj99JNW16o0g@mail.gmail.com>

On 21 July 2016 at 15:20, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp

wrote:

After some consideration, I found that ExecAsyncWaitForNode
cannot be reentrant because it means that the control goes into
async-unaware nodes while having not-ready nodes, that is
inconsistent state. To inhibit such reentering, I allocated node
identifiers in depth-first order so that ascendant-descendant
relationship can be checked (nested-set model) in simple way and
call ExecAsyncConfigureWait only for the descendant nodes of the
parameter planstate.

We have estate->waiting_nodes containing a mix of async-aware and
non-async-aware nodes. I was thinking, an asynchrony tree would have only
async-aware nodes, with possible multiple asynchrony sub-trees in a tree.
Somehow, if we restrict the bubbling up of events only upto the root of the
asynchrony subtree, do you think we can simplify some of the complexities ?

The current code prohibiting regsitration of nodes outside the
current subtree to avoid the reentring-disaster.

Indeed leaving the "waiting node" mark or something like on every
root node at the first visit will enable the propagation to stop
upto the root of any async-subtree. Neverheless, when an
async-child in an inactive async-root fires, the new tuple is
loaded but is not consumed then the succeeding firing on the same
child leads to a dead-lock (without result queueing). However,
that can be avoided if ExecAsyncConfigureWait doesn't register
nodes in ready state.

Why would a node call ExecAsyncConfigureWait in the first place if it
already had a result ready? I think it shouldn't do that.

On the other hand, any two or more asynchronous nodes can share a
syncronization object. For instance, multiple postgres_fdw scan
node can share one server connection and only one of them can get
into waitable state at once. If no async-child in the current
async subtree is waitable, it must be stuck. So I think it is
crucial for ExecAsyncWaitForNode to force at least one child *in
the current async subtree* to get into waiting state for such
situation. The ascendant-descendant relationship is necessary to
do that anyway.

This is another example of a situation where waiting only for nodes
within a subtree causes problems.

Suppose there are two Foreign Scans in completely different parts of
the plan tree that are going to use, in alternation, the same
connection to the same remote server. When we encounter the first
one, it kicks off the query, uses ExecAsyncConfigureWait to register
itself as waiting, and returns without becoming ready. When we
encounter the second one, it can't kick off the query and therefore
has no chance of becoming ready until after the first one has finished
with the connection. Suppose we then wait for the second Foreign
Scan. Well, we had better wait for the first one, too! If we don't,
it will never finish with the connection, so the second node will
never get to use it, and now we're in trouble.

I think what we need is for the ConnCacheEntry to have a place to note
the ForeignScanState that is using the connection and any other
PlanState objects that would like to use it. When one
ForeignScanState is done with the ConnCacheEntry, it activates the
next one, which then takes over. That seems simple enough, but
there's a problem here for suspended queries: if we stop executing a
plan while some scan within that plan is holding onto a
ConnCacheEntry, and then we run some other query that wants to use the
same one, we've got a problem. Maybe we can get by with letting the
other query finish running and then executing our own query, but that
might be messy to implement. Another idea is to somehow let any
in-progress query finish running before allowing the first query to be
suspended; that would need some new infrastructure.

My main point here is that I think waiting for only a subtree is an
idea that cannot work out well. Whatever problems are pushing you
into that design, we need to confront those problems directly and fix
them. There shouldn't be any unsolvable problems in waiting for
everything in the whole query, and I'm pretty sure that's going to be
a more elegant and better-performing design.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#59Amit Khandekar
amitdkhan.pg@gmail.com
In reply to: Robert Haas (#57)
Re: asynchronous and vectorized execution

On 13 September 2016 at 20:20, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

[ new patches ]

+            /*
+             * We assume that few nodes are async-aware and async-unaware
+             * nodes cannot be revserse-dispatched from lower nodes that
is
+             * async-aware. Firing of an async node that is not a
descendant
+             * of the planstate will cause such reverse-diaptching to
+             * async-aware nodes, which is unexpected behavior for them.
+             *
+             * For instance, consider an async-unaware Hashjoin(OUTER,
INNER)
+             * where the OUTER is running asynchronously but the Hashjoin
is
+             * waiting on the async INNER during inner-hash creation. If
the
+             * OUTER fires for the case, since anyone is waiting on it,
+             * ExecAsyncWaitForNode finally dispatches to the Hashjoin
which
+             * is now in the middle of thing its work.
+             */
+            if (!IsParent(planstate, node))
+                continue;

I'm not entirely sure that I understand this comment, but I don't
think it's going in the right direction. Let's start with the example
in the second paragraph. If the hash join is async-unaware, then it
isn't possible for the hash join to be both running the outer side of
the join asynchronously and at the same time waiting on the inner
side. Once it tries to pull the first tuple from the outer side, it's
waiting for that to finish and can't do anything else. So, the inner
side can't possibly get touched in any way until the outer side
finishes. For anything else to happen, the hash join would have to be
async-aware. Even if we did that, I don't think it would be right to
kick off both sides of the hash join at the same time. Right now, if
the outer side turns out to be empty, we never need to build the hash
table, and that's good.

I feel the !IsParent() condition is actually to prevent the infinite wait
caused by a re-entrant issue in ExecAsuncWaitForNode() that Kyotaro
mentioned
earlier. But yes, the comments don't explain exactly how the hash join can
cause the re-entrant issue.

But I attempted to come up with some testcase which might reproduce the
infinite-waiting in ExecAsyncWaitForNode() after removing the !IsParent()
check
so that the other subtree nodes are also included, but I couldn't reproduce.
Kyotaro, is it possible for you to give a testcase that consistently hangs
if
we revert back the !IsParent() check ?

I was also thinking about another possibility where the same plan state
node is
re-entered, as explained below.

I don't think it's a good idea to wait for only nodes that are in the
current subtree. For example, consider a plan like this:

Append
-> Foreign Scan on a
-> Hash Join
-> Foreign Scan on b
-> Hash
-> Seq Scan on x

Suppose Append and Foreign Scan are parallel-aware but the other nodes
are not. Append kicks off the Foreign Scan on a and then waits for
the hash join to produce a tuple; the hash join kicks off the Foreign
Scan on b and waits for it to return a tuple. If, while we're waiting
for the foreign scan on b, the foreign scan on a needs some attention
- either to produce tuples, or maybe just to call PQconsumeInput() so
that more data can be sent from the other side, I think we need to be
able to do that. There's no real problem here; even if the Append
becomes result-ready before the hash join returns, that is fine.

Yes I agree : we should be able to do this. Sine we have all the waiting
events
in a common estate, there's no harm if we start executing nodes of another
sub-tree if we get an event from there.

But I am thinking about what would happen when this node from other sub-tree
returns result_ready, and then it's parents are called, and then the result
gets bubbled up upto the node which had already caused us to call
ExecAsyncWaitForNode() in the first place.

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Is this safe to execute the same plan state when it is already inside its
execution ? In other words, is the plan state re-entrant ? I suspect, the
new
execution may even corrupt the structures with which it was already
executing.

In usual cases, a tree can contain multiple plan state nodes belonging to
the
same plan node, but in this case, we are having the same plan state node
being
executed again while it is already executing.

I suspect this can be one reason why Kyotaro might be getting infinite
recursion issues. May be we need to prevent a plan state node to re-enter,
but allow nodes from any subtree to execute. So propagate the result upwards
until we get a node which is already executing.

#60Robert Haas
robertmhaas@gmail.com
In reply to: Amit Khandekar (#59)
Re: asynchronous and vectorized execution

On Fri, Sep 23, 2016 at 8:45 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

Ah, yeah, something like that could happen. I've spent much of this
week working on a new design for this feature which I think will avoid
this problem. It doesn't work yet - in fact I can't even really test
it yet. But I'll post what I've got by the end of the day today so
that anyone who is interested can look at it and critique.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#61Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Amit Khandekar (#59)
Re: asynchronous and vectorized execution

Hello, thank you for the comment.

At Fri, 23 Sep 2016 18:15:40 +0530, Amit Khandekar <amitdkhan.pg@gmail.com> wrote in <CAJ3gD9fZ=rtBZ0i1_pxycbkgxi=OzTgv1n0ojkmK318Mcc921A@mail.gmail.com>

On 13 September 2016 at 20:20, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Aug 29, 2016 at 4:08 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

[ new patches ]

+            /*
+             * We assume that few nodes are async-aware and async-unaware
+             * nodes cannot be revserse-dispatched from lower nodes that
is
+             * async-aware. Firing of an async node that is not a
descendant
+             * of the planstate will cause such reverse-diaptching to
+             * async-aware nodes, which is unexpected behavior for them.
+             *
+             * For instance, consider an async-unaware Hashjoin(OUTER,
INNER)
+             * where the OUTER is running asynchronously but the Hashjoin
is
+             * waiting on the async INNER during inner-hash creation. If
the
+             * OUTER fires for the case, since anyone is waiting on it,
+             * ExecAsyncWaitForNode finally dispatches to the Hashjoin
which
+             * is now in the middle of thing its work.
+             */
+            if (!IsParent(planstate, node))
+                continue;

I'm not entirely sure that I understand this comment, but I don't

Sorry for the read-resistant comment...

think it's going in the right direction. Let's start with the example
in the second paragraph. If the hash join is async-unaware, then it
isn't possible for the hash join to be both running the outer side of
the join asynchronously and at the same time waiting on the inner
side. Once it tries to pull the first tuple from the outer side, it's
waiting for that to finish and can't do anything else. So, the inner
side can't possibly get touched in any way until the outer side
finishes. For anything else to happen, the hash join would have to be
async-aware. Even if we did that, I don't think it would be right to
kick off both sides of the hash join at the same time. Right now, if
the outer side turns out to be empty, we never need to build the hash
table, and that's good.

I feel the !IsParent() condition is actually to prevent the infinite wait
caused by a re-entrant issue in ExecAsuncWaitForNode() that Kyotaro
mentioned
earlier. But yes, the comments don't explain exactly how the hash join can
cause the re-entrant issue.

But I attempted to come up with some testcase which might reproduce the
infinite-waiting in ExecAsyncWaitForNode() after removing the !IsParent()
check
so that the other subtree nodes are also included, but I couldn't reproduce.
Kyotaro, is it possible for you to give a testcase that consistently hangs
if
we revert back the !IsParent() check ?

I dragged out from my memory that it happened during the
regression test of postgres_fdw, and it still reproducible in the
same manner.

postgres_fdw> make check
...
============== running regression test queries ==============
test postgres_fdw ... FAILED (test process exited with exit code 2)
...

And in server log,

== contrib/postgres_fdw/log/postmaster.log
TRAP: FailedAssertion("!(hashtable == ((void *)0))", File: "nodeHashjoin.c", Line: 123)
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
LOG: server process (PID 9130) was terminated by signal 6: Aborted
DETAIL: Failed process was running: SELECT * FROM ft1 t1 WHERE t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 <= 10) ORDER BY c1;

nodeHashjoin.c:116:

switch (node->hj_JoinState)
{
case HJ_BUILD_HASHTABLE:

/*
* First time through: build hash table for inner relation.
*/
Assert(hashtable == NULL);

This is the reentrance of ExecHashJoin.

Instead, after doing installcheck, then connecting to the
database "contrib_regression" after failure, we can see what plan
has been tried.

contrib_regression=# explain SELECT * FROM ft1 t1 WHERE t1.c3 IN (SELECT c3 FROM ft2 t2 WHERE c1 <= 10) ORDER BY c1;
QUERY PLAN

--------------------------------------------------------------------------------
-------
Sort (cost=275.96..277.21 rows=500 width=47)
Sort Key: t1.c1
-> Hash Join (cost=208.78..253.54 rows=500 width=47)
Hash Cond: (t1.c3 = t2.c3)
-> Foreign Scan on ft1 t1 (cost=100.00..141.00 rows=1000 width=47)
-> Hash (cost=108.77..108.77 rows=1 width=6)
-> HashAggregate (cost=108.76..108.77 rows=1 width=6)
Group Key: t2.c3
-> Foreign Scan on ft2 t2 (cost=100.28..108.73 rows=12 wi
dth=6)
(9 rows)

I was also thinking about another possibility where the same plan state
node is
re-entered, as explained below.

I don't think it's a good idea to wait for only nodes that are in the
current subtree. For example, consider a plan like this:

Append
-> Foreign Scan on a
-> Hash Join
-> Foreign Scan on b
-> Hash
-> Seq Scan on x

Suppose Append and Foreign Scan are parallel-aware but the other nodes
are not. Append kicks off the Foreign Scan on a and then waits for
the hash join to produce a tuple; the hash join kicks off the Foreign
Scan on b and waits for it to return a tuple. If, while we're waiting
for the foreign scan on b, the foreign scan on a needs some attention
- either to produce tuples, or maybe just to call PQconsumeInput() so
that more data can be sent from the other side, I think we need to be
able to do that. There's no real problem here; even if the Append
becomes result-ready before the hash join returns, that is fine.

Yes I agree : we should be able to do this. Sine we have all the waiting
events
in a common estate, there's no harm if we start executing nodes of another
sub-tree if we get an event from there.

But I am thinking about what would happen when this node from other sub-tree
returns result_ready, and then it's parents are called, and then the result
gets bubbled up upto the node which had already caused us to call
ExecAsyncWaitForNode() in the first place.

For e.g., in the above plan which you specified, suppose :
1. Hash Join has called ExecProcNode() for the child foreign scan b, and so
is
waiting in ExecAsyncWaitForNode(foreign_scan_on_b).
2. The event wait list already has foreign scan on a that is on a different
subtree.
3. This foreign scan a happens to be ready, so in
ExecAsyncWaitForNode (), ExecDispatchNode(foreign_scan_a) is called,
which returns with result_ready.
4. Since it returns result_ready, it's parent node is now inserted in the
callbacks array, and so it's parent (Append) is executed.
5. But, this Append planstate is already in the middle of executing Hash
join, and is waiting for HashJoin.

This should be what I wanted to explain by the encrypted commnet:(

Is this safe to execute the same plan state when it is already inside its
execution ? In other words, is the plan state re-entrant ? I suspect, the
new
execution may even corrupt the structures with which it was already
executing.

It should be safe for most cases, but HashJoin and some other
nodes have inner state other than descendant nodes. Such nodes
cannot be reentered.

In usual cases, a tree can contain multiple plan state nodes belonging to
the
same plan node, but in this case, we are having the same plan state node
being
executed again while it is already executing.

I suspect this can be one reason why Kyotaro might be getting infinite
recursion issues. May be we need to prevent a plan state node to re-enter,
but allow nodes from any subtree to execute. So propagate the result upwards
until we get a node which is already executing.

Sorry for no response, but, the answer is yes. We could be able
to avoid the problem by managing execution state for every
node. But it needs an additional flag in *State structs and
manipulating on the way shuttling up and down around the
execution tree.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#62Michael Paquier
michael.paquier@gmail.com
In reply to: Kyotaro HORIGUCHI (#61)
Re: asynchronous and vectorized execution

On Thu, Sep 29, 2016 at 5:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Sorry for no response, but, the answer is yes. We could be able
to avoid the problem by managing execution state for every
node. But it needs an additional flag in *State structs and
manipulating on the way shuttling up and down around the
execution tree.

Moved to next CF.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#63Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Michael Paquier (#62)
Re: asynchronous and vectorized execution

At Mon, 3 Oct 2016 13:14:23 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in <CAB7nPqSf8dBndoKT5DeR6FpzDUSuXN_g7uWNPQuN_A_sEwB-uw@mail.gmail.com>

On Thu, Sep 29, 2016 at 5:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Sorry for no response, but, the answer is yes. We could be able
to avoid the problem by managing execution state for every
node. But it needs an additional flag in *State structs and
manipulating on the way shuttling up and down around the
execution tree.

Moved to next CF.

Thank you.

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#64Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Kyotaro HORIGUCHI (#63)
Re: asynchronous and vectorized execution

On Mon, Oct 3, 2016 at 3:25 PM, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Mon, 3 Oct 2016 13:14:23 +0900, Michael Paquier <
michael.paquier@gmail.com> wrote in <CAB7nPqSf8dBndoKT5DeR6FpzDUSu
XN_g7uWNPQuN_A_sEwB-uw@mail.gmail.com>

On Thu, Sep 29, 2016 at 5:50 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Sorry for no response, but, the answer is yes. We could be able
to avoid the problem by managing execution state for every
node. But it needs an additional flag in *State structs and
manipulating on the way shuttling up and down around the
execution tree.

Moved to next CF.

Thank you.

Closed in 2016-11 commitfest with "returned with feedback" status.
This is as per my understanding of the recent mails on the thread.
Please feel free to update the status if the current status doesn't
reflect the exact status of the patch.

Regards,
Hari Babu
Fujitsu Australia