PassDownLimitBound for ForeignScan/CustomScan [take-2]

Started by Kouhei Kaigaiabout 9 years ago7 messages
#1Kouhei Kaigai
kaigai@ak.jp.nec.com
1 attachment(s)

Hello,

The attached patch is revised version of the pass-down-bounds feature.
Its functionality is not changed from the previous version, however,
implementation was revised according to the discussion at the last CF.

This patch add a new fields (ps_numTuples) to the PlanState. This is
a hint for optimization when parent node needs first N-tuples only.
It shall be set prior to ExecProcNode() after ExecInitNode() or
ExecReScan() by the parent node, then child nodes can adjust its
execution behavior (e.g, Sort will take top-N heapsort if ps_numTuples
is set) and pass down the hint to its child nodes furthermore.

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

* without patch
=================
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Limit (cost=100.00..100.43 rows=10 width=52) (actual time=2332.548..2332.550 rows=10 loops=1)
Output: id, x, y, z
-> Foreign Scan on public.ft (cost=100.00..146.46 rows=1077 width=52) (actual time=2332.547..2332.548 rows=10 loops=1)
Output: id, x, y, z
Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS LAST, y ASC NULLS LAST
Planning time: 0.177 ms
Execution time: 2445.590 ms
(7 rows)

* with patch
==============
postgres=# explain (analyze,verbose) select * from ft order by x,y limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Limit (cost=100.00..100.43 rows=10 width=52) (actual time=579.469..579.471 rows=10 loops=1)
Output: id, x, y, z
-> Foreign Scan on public.ft (cost=100.00..146.46 rows=1077 width=52) (actual time=579.468..579.469 rows=10 loops=1)
Output: id, x, y, z
Remote SQL: SELECT id, x, y, z FROM public.t ORDER BY x ASC NULLS LAST, y ASC NULLS LAST
Planning time: 0.123 ms
Execution time: 579.858 ms
(7 rows)

Right now, I have a few concerns for this patch.
1. Because LIMIT clause can have expression not only constant value,
we cannot determine the value of ps_numTuples until execution time.
So, it is not possible to adjust remote query on planning time, and
EXPLAIN does not show exact remote query even if LIMIT clause was
attached actually.

2. Where is the best location to put the interface contract to set
ps_numTuples field. It has to be set prior to the first ExecProcNode()
after ExecInitNode() or ExecReScan().

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Show quoted text

-----Original Message-----
From: Robert Haas [mailto:robertmhaas@gmail.com]
Sent: Friday, September 16, 2016 12:39 AM
To: Kaigai Kouhei(海外 浩平)
Cc: Jeevan Chalke; pgsql-hackers@postgresql.org; Etsuro Fujita; Andres Freund
Subject: ##freemail## Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan

On Tue, Sep 13, 2016 at 9:07 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

In the current implementation calls recompute_limits() on the first
invocation of ExecLimit and ExecReScanLimit. Do we expect the
ps->numTuples will be also passed down to the child nodes on the same
timing?

Sure, unless we find some reason why that's not good.

I also think this new executor contract shall be considered as a hint
(but not a requirement) for the child nodes, because it allows the
parent nodes to re-distribute the upper limit regardless of the type
of the child nodes as long as the parent node can work correctly and
has benefit even if the child node returns a part of tuples. It makes
the decision whether the upper limit should be passed down much simple.
The child node "can" ignore the hint but can utilize for more optimization.

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

passdown-limit-fdw.v1.patchapplication/octet-stream; name=passdown-limit-fdw.v1.patchDownload
 contrib/postgres_fdw/postgres_fdw.c        |  4 ++
 src/backend/executor/nodeAgg.c             |  1 +
 src/backend/executor/nodeAppend.c          |  1 +
 src/backend/executor/nodeBitmapAnd.c       |  1 +
 src/backend/executor/nodeBitmapHeapscan.c  |  1 +
 src/backend/executor/nodeBitmapIndexscan.c |  1 +
 src/backend/executor/nodeBitmapOr.c        |  1 +
 src/backend/executor/nodeCtescan.c         |  1 +
 src/backend/executor/nodeCustom.c          |  1 +
 src/backend/executor/nodeForeignscan.c     |  1 +
 src/backend/executor/nodeFunctionscan.c    |  1 +
 src/backend/executor/nodeGather.c          |  1 +
 src/backend/executor/nodeGroup.c           |  1 +
 src/backend/executor/nodeHash.c            |  1 +
 src/backend/executor/nodeHashjoin.c        |  1 +
 src/backend/executor/nodeIndexonlyscan.c   |  1 +
 src/backend/executor/nodeIndexscan.c       |  1 +
 src/backend/executor/nodeLimit.c           | 69 ++----------------------------
 src/backend/executor/nodeLockRows.c        |  1 +
 src/backend/executor/nodeMaterial.c        |  1 +
 src/backend/executor/nodeMergeAppend.c     |  7 ++-
 src/backend/executor/nodeMergejoin.c       |  1 +
 src/backend/executor/nodeModifyTable.c     |  1 +
 src/backend/executor/nodeNestloop.c        |  1 +
 src/backend/executor/nodeRecursiveunion.c  |  1 +
 src/backend/executor/nodeResult.c          | 26 +++++++++++
 src/backend/executor/nodeSamplescan.c      |  1 +
 src/backend/executor/nodeSeqscan.c         |  1 +
 src/backend/executor/nodeSetOp.c           |  1 +
 src/backend/executor/nodeSort.c            | 11 +++++
 src/backend/executor/nodeSubqueryscan.c    |  1 +
 src/backend/executor/nodeTidscan.c         |  1 +
 src/backend/executor/nodeUnique.c          |  1 +
 src/backend/executor/nodeValuesscan.c      |  1 +
 src/backend/executor/nodeWindowAgg.c       |  1 +
 src/backend/executor/nodeWorktablescan.c   |  1 +
 src/include/nodes/execnodes.h              |  3 ++
 37 files changed, 85 insertions(+), 66 deletions(-)

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 906d6e6..eaaa083 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2942,6 +2942,10 @@ create_cursor(ForeignScanState *node)
 	appendStringInfo(&buf, "DECLARE c%u CURSOR FOR\n%s",
 					 fsstate->cursor_number, fsstate->query);
 
+	/* Append LIMIT if numTuples were passed-down */
+	if (node->ss.ps.ps_numTuples >= 0)
+		appendStringInfo(&buf, " LIMIT %ld", node->ss.ps.ps_numTuples);
+
 	/*
 	 * Notice that we pass NULL for paramTypes, thus forcing the remote server
 	 * to infer types for all parameters.  Since we explicitly cast every
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 28c15ba..4533191 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2359,6 +2359,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate = makeNode(AggState);
 	aggstate->ss.ps.plan = (Plan *) node;
 	aggstate->ss.ps.state = estate;
+	aggstate->ss.ps.ps_numTuples = -1;
 
 	aggstate->aggs = NIL;
 	aggstate->numaggs = 0;
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a26bd63..ebf2da4 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -140,6 +140,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	 */
 	appendstate->ps.plan = (Plan *) node;
 	appendstate->ps.state = estate;
+	appendstate->ps.ps_numTuples = -1;
 	appendstate->appendplans = appendplanstates;
 	appendstate->as_nplans = nplans;
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index c39d790..0f4b608 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -63,6 +63,7 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	 */
 	bitmapandstate->ps.plan = (Plan *) node;
 	bitmapandstate->ps.state = estate;
+	bitmapandstate->ps.ps_numTuples = -1;
 	bitmapandstate->bitmapplans = bitmapplanstates;
 	bitmapandstate->nplans = nplans;
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 449aacb..497116f6 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -556,6 +556,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	scanstate = makeNode(BitmapHeapScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 
 	scanstate->tbm = NULL;
 	scanstate->tbmiterator = NULL;
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index a364098..e04ab12 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -206,6 +206,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
 	indexstate = makeNode(BitmapIndexScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ps.ps_numTuples = -1;
 
 	/* normally we don't make the result bitmap till runtime */
 	indexstate->biss_result = NULL;
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 7e928eb..30b0c2b 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -64,6 +64,7 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	 */
 	bitmaporstate->ps.plan = (Plan *) node;
 	bitmaporstate->ps.state = estate;
+	bitmaporstate->ps.ps_numTuples = -1;
 	bitmaporstate->bitmapplans = bitmapplanstates;
 	bitmaporstate->nplans = nplans;
 
diff --git a/src/backend/executor/nodeCtescan.c b/src/backend/executor/nodeCtescan.c
index 162650a..9cbec47 100644
--- a/src/backend/executor/nodeCtescan.c
+++ b/src/backend/executor/nodeCtescan.c
@@ -191,6 +191,7 @@ ExecInitCteScan(CteScan *node, EState *estate, int eflags)
 	scanstate = makeNode(CteScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 	scanstate->eflags = eflags;
 	scanstate->cte_table = NULL;
 	scanstate->eof_cte = false;
diff --git a/src/backend/executor/nodeCustom.c b/src/backend/executor/nodeCustom.c
index 322abca..43e1910 100644
--- a/src/backend/executor/nodeCustom.c
+++ b/src/backend/executor/nodeCustom.c
@@ -44,6 +44,7 @@ ExecInitCustomScan(CustomScan *cscan, EState *estate, int eflags)
 	/* fill up fields of ScanState */
 	css->ss.ps.plan = &cscan->scan.plan;
 	css->ss.ps.state = estate;
+	css->ss.ps.ps_numTuples = -1;
 
 	/* create expression context for node */
 	ExecAssignExprContext(estate, &css->ss.ps);
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index d886aaf..6f38b79 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -144,6 +144,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	scanstate = makeNode(ForeignScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeFunctionscan.c b/src/backend/executor/nodeFunctionscan.c
index 5a0f324..68f7f9c 100644
--- a/src/backend/executor/nodeFunctionscan.c
+++ b/src/backend/executor/nodeFunctionscan.c
@@ -299,6 +299,7 @@ ExecInitFunctionScan(FunctionScan *node, EState *estate, int eflags)
 	scanstate = makeNode(FunctionScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 	scanstate->eflags = eflags;
 
 	/*
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 880ca62..6a1c6cf 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -69,6 +69,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	gatherstate = makeNode(GatherState);
 	gatherstate->ps.plan = (Plan *) node;
 	gatherstate->ps.state = estate;
+	gatherstate->ps.ps_numTuples = -1;
 	gatherstate->need_to_scan_locally = !node->single_copy;
 
 	/*
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index dcf5175..b53f423 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -207,6 +207,7 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	grpstate = makeNode(GroupState);
 	grpstate->ss.ps.plan = (Plan *) node;
 	grpstate->ss.ps.state = estate;
+	grpstate->ss.ps.ps_numTuples = -1;
 	grpstate->grp_done = FALSE;
 
 	/*
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 6375d9b..20d7dc8 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -172,6 +172,7 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	hashstate = makeNode(HashState);
 	hashstate->ps.plan = (Plan *) node;
 	hashstate->ps.state = estate;
+	hashstate->ps.ps_numTuples = -1;
 	hashstate->hashtable = NULL;
 	hashstate->hashkeys = NIL;	/* will be set by parent HashJoin */
 
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index 369e666..a13a83a 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -451,6 +451,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate = makeNode(HashJoinState);
 	hjstate->js.ps.plan = (Plan *) node;
 	hjstate->js.ps.state = estate;
+	hjstate->js.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 4f6f91c..756cc11 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -403,6 +403,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexOnlyScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ps.ps_numTuples = -1;
 	indexstate->ioss_HeapFetches = 0;
 
 	/*
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 3143bd9..c977ade 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -829,6 +829,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
 	indexstate = makeNode(IndexScanState);
 	indexstate->ss.ps.plan = (Plan *) node;
 	indexstate->ss.ps.state = estate;
+	indexstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index faf32e1..04c8830 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -26,7 +26,6 @@
 #include "nodes/nodeFuncs.h"
 
 static void recompute_limits(LimitState *node);
-static void pass_down_bound(LimitState *node, PlanState *child_node);
 
 
 /* ----------------------------------------------------------------
@@ -232,6 +231,7 @@ static void
 recompute_limits(LimitState *node)
 {
 	ExprContext *econtext = node->ps.ps_ExprContext;
+	long		tuples_needed;
 	Datum		val;
 	bool		isNull;
 
@@ -296,70 +296,8 @@ recompute_limits(LimitState *node)
 	node->lstate = LIMIT_RESCAN;
 
 	/* Notify child node about limit, if useful */
-	pass_down_bound(node, outerPlanState(node));
-}
-
-/*
- * If we have a COUNT, and our input is a Sort node, notify it that it can
- * use bounded sort.  Also, if our input is a MergeAppend, we can apply the
- * same bound to any Sorts that are direct children of the MergeAppend,
- * since the MergeAppend surely need read no more than that many tuples from
- * any one input.  We also have to be prepared to look through a Result,
- * since the planner might stick one atop MergeAppend for projection purposes.
- *
- * This is a bit of a kluge, but we don't have any more-abstract way of
- * communicating between the two nodes; and it doesn't seem worth trying
- * to invent one without some more examples of special communication needs.
- *
- * Note: it is the responsibility of nodeSort.c to react properly to
- * changes of these parameters.  If we ever do redesign this, it'd be a
- * good idea to integrate this signaling with the parameter-change mechanism.
- */
-static void
-pass_down_bound(LimitState *node, PlanState *child_node)
-{
-	if (IsA(child_node, SortState))
-	{
-		SortState  *sortState = (SortState *) child_node;
-		int64		tuples_needed = node->count + node->offset;
-
-		/* negative test checks for overflow in sum */
-		if (node->noCount || tuples_needed < 0)
-		{
-			/* make sure flag gets reset if needed upon rescan */
-			sortState->bounded = false;
-		}
-		else
-		{
-			sortState->bounded = true;
-			sortState->bound = tuples_needed;
-		}
-	}
-	else if (IsA(child_node, MergeAppendState))
-	{
-		MergeAppendState *maState = (MergeAppendState *) child_node;
-		int			i;
-
-		for (i = 0; i < maState->ms_nplans; i++)
-			pass_down_bound(node, maState->mergeplans[i]);
-	}
-	else if (IsA(child_node, ResultState))
-	{
-		/*
-		 * An extra consideration here is that if the Result is projecting a
-		 * targetlist that contains any SRFs, we can't assume that every input
-		 * tuple generates an output tuple, so a Sort underneath might need to
-		 * return more than N tuples to satisfy LIMIT N. So we cannot use
-		 * bounded sort.
-		 *
-		 * If Result supported qual checking, we'd have to punt on seeing a
-		 * qual, too.  Note that having a resconstantqual is not a
-		 * showstopper: if that fails we're not getting any rows at all.
-		 */
-		if (outerPlanState(child_node) &&
-			!expression_returns_set((Node *) child_node->plan->targetlist))
-			pass_down_bound(node, outerPlanState(child_node));
-	}
+	tuples_needed = (node->noCount ? -1 : node->count + node->offset);
+	outerPlanState(node)->ps_numTuples = tuples_needed;
 }
 
 /* ----------------------------------------------------------------
@@ -384,6 +322,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	limitstate = makeNode(LimitState);
 	limitstate->ps.plan = (Plan *) node;
 	limitstate->ps.state = estate;
+	limitstate->ps.ps_numTuples = -1;
 
 	limitstate->lstate = LIMIT_INITIAL;
 
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index 4ebcaff..c5d430f 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -361,6 +361,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	lrstate = makeNode(LockRowsState);
 	lrstate->ps.plan = (Plan *) node;
 	lrstate->ps.state = estate;
+	lrstate->ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index 9ab03f3..e66f7aa 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -171,6 +171,7 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	matstate = makeNode(MaterialState);
 	matstate->ss.ps.plan = (Plan *) node;
 	matstate->ss.ps.state = estate;
+	matstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * We must have a tuplestore buffering the subplan output to do backward
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index e271927..2a767c8 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -83,6 +83,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	 */
 	mergestate->ps.plan = (Plan *) node;
 	mergestate->ps.state = estate;
+	mergestate->ps.ps_numTuples = -1;
 	mergestate->mergeplans = mergeplanstates;
 	mergestate->ms_nplans = nplans;
 
@@ -174,10 +175,14 @@ ExecMergeAppend(MergeAppendState *node)
 		/*
 		 * First time through: pull the first tuple from each subplan, and set
 		 * up the heap.
+		 * Also, pass down the required number of tuples
 		 */
 		for (i = 0; i < node->ms_nplans; i++)
 		{
-			node->ms_slots[i] = ExecProcNode(node->mergeplans[i]);
+			PlanState  *pstate = node->mergeplans[i];
+
+			pstate->ps_numTuples = node->ps.ps_numTuples;
+			node->ms_slots[i] = ExecProcNode(pstate);
 			if (!TupIsNull(node->ms_slots[i]))
 				binaryheap_add_unordered(node->ms_heap, Int32GetDatum(i));
 		}
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 6db09b8..5d2be13 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1485,6 +1485,7 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	mergestate = makeNode(MergeJoinState);
 	mergestate->js.ps.plan = (Plan *) node;
 	mergestate->js.ps.state = estate;
+	mergestate->js.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index efb0c5e..a375310 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1576,6 +1576,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	mtstate->ps.plan = (Plan *) node;
 	mtstate->ps.state = estate;
 	mtstate->ps.targetlist = NIL;		/* not actually used */
+	mtstate->ps.ps_numTuples = -1;
 
 	mtstate->operation = operation;
 	mtstate->canSetTag = node->canSetTag;
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index 555fa09..d0850ae 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -309,6 +309,7 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	nlstate = makeNode(NestLoopState);
 	nlstate->js.ps.plan = (Plan *) node;
 	nlstate->js.ps.state = estate;
+	nlstate->js.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index acded07..9d115ec 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -169,6 +169,7 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	rustate = makeNode(RecursiveUnionState);
 	rustate->ps.plan = (Plan *) node;
 	rustate->ps.state = estate;
+	rustate->ps.ps_numTuples = -1;
 
 	rustate->eqfunctions = NULL;
 	rustate->hashfunctions = NULL;
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index 4007b76..87aee77 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -47,6 +47,7 @@
 
 #include "executor/executor.h"
 #include "executor/nodeResult.h"
+#include "nodes/nodeFuncs.h"
 #include "utils/memutils.h"
 
 
@@ -75,6 +76,28 @@ ExecResult(ResultState *node)
 	econtext = node->ps.ps_ExprContext;
 
 	/*
+	 * Pass down the number of required tuples by the upper node
+	 *
+	 * An extra consideration here is that if the Result is projecting a
+	 * targetlist that contains any SRFs, we can't assume that every input
+	 * tuple generates an output tuple, so a Sort underneath might need to
+	 * return more than N tuples to satisfy LIMIT N. So we cannot use
+	 * bounded sort.
+	 *
+	 * If Result supported qual checking, we'd have to punt on seeing a
+	 * qual, too.  Note that having a resconstantqual is not a
+	 * showstopper: if that fails we're not getting any rows at all.
+	 */
+	if (!node->rs_started)
+	{
+		if (outerPlanState(node) &&
+			!expression_returns_set((Node *) node->ps.plan->targetlist))
+			outerPlanState(node)->ps_numTuples = node->ps.ps_numTuples;
+
+		node->rs_started = true;
+	}
+
+	/*
 	 * check constant qualifications like (2 > 1), if not already done
 	 */
 	if (node->rs_checkqual)
@@ -217,7 +240,9 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	resstate = makeNode(ResultState);
 	resstate->ps.plan = (Plan *) node;
 	resstate->ps.state = estate;
+	resstate->ps.ps_numTuples = -1;
 
+	resstate->rs_started = false;
 	resstate->rs_done = false;
 	resstate->rs_checkqual = (node->resconstantqual == NULL) ? false : true;
 
@@ -294,6 +319,7 @@ ExecEndResult(ResultState *node)
 void
 ExecReScanResult(ResultState *node)
 {
+	node->rs_started = false;
 	node->rs_done = false;
 	node->ps.ps_TupFromTlist = false;
 	node->rs_checkqual = (node->resconstantqual == NULL) ? false : true;
diff --git a/src/backend/executor/nodeSamplescan.c b/src/backend/executor/nodeSamplescan.c
index 9ce7c02..978e122 100644
--- a/src/backend/executor/nodeSamplescan.c
+++ b/src/backend/executor/nodeSamplescan.c
@@ -152,6 +152,7 @@ ExecInitSampleScan(SampleScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SampleScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 00bf3a5..f239b75 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -177,6 +177,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SeqScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index e94555e..8d83e71 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -479,6 +479,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	setopstate = makeNode(SetOpState);
 	setopstate->ps.plan = (Plan *) node;
 	setopstate->ps.state = estate;
+	setopstate->ps.ps_numTuples = -1;
 
 	setopstate->eqfunctions = NULL;
 	setopstate->hashfunctions = NULL;
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index a34dcc5..0c80ccd 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -66,6 +66,16 @@ ExecSort(SortState *node)
 
 		SO1_printf("ExecSort: %s\n",
 				   "sorting subplan");
+		/*
+		 * Check bounds according to the required number of tuples
+		 */
+		if (node->ss.ps.ps_numTuples < 0)
+			node->bounded = false;
+		else
+		{
+			node->bounded = true;
+			node->bound = node->ss.ps.ps_numTuples;
+		}
 
 		/*
 		 * Want to scan subplan in the forward direction while creating the
@@ -162,6 +172,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate = makeNode(SortState);
 	sortstate->ss.ps.plan = (Plan *) node;
 	sortstate->ss.ps.state = estate;
+	sortstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * We must have random access to the sort output to do backward scan or
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 9bafc62..be9f532 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -109,6 +109,7 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	subquerystate = makeNode(SubqueryScanState);
 	subquerystate->ss.ps.plan = (Plan *) node;
 	subquerystate->ss.ps.state = estate;
+	subquerystate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeTidscan.c b/src/backend/executor/nodeTidscan.c
index d54fe36..c98a31c 100644
--- a/src/backend/executor/nodeTidscan.c
+++ b/src/backend/executor/nodeTidscan.c
@@ -461,6 +461,7 @@ ExecInitTidScan(TidScan *node, EState *estate, int eflags)
 	tidstate = makeNode(TidScanState);
 	tidstate->ss.ps.plan = (Plan *) node;
 	tidstate->ss.ps.state = estate;
+	tidstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index f45c792..460f18b 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -122,6 +122,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	uniquestate = makeNode(UniqueState);
 	uniquestate->ps.plan = (Plan *) node;
 	uniquestate->ps.state = estate;
+	uniquestate->ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeValuesscan.c b/src/backend/executor/nodeValuesscan.c
index 9c03f8a..4ea2d90 100644
--- a/src/backend/executor/nodeValuesscan.c
+++ b/src/backend/executor/nodeValuesscan.c
@@ -219,6 +219,7 @@ ExecInitValuesScan(ValuesScan *node, EState *estate, int eflags)
 	scanstate = makeNode(ValuesScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 06f8aa0..91491de 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1820,6 +1820,7 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	winstate = makeNode(WindowAggState);
 	winstate->ss.ps.plan = (Plan *) node;
 	winstate->ss.ps.state = estate;
+	winstate->ss.ps.ps_numTuples = -1;
 
 	/*
 	 * Create expression contexts.  We need two, one for per-input-tuple
diff --git a/src/backend/executor/nodeWorktablescan.c b/src/backend/executor/nodeWorktablescan.c
index cfed6e6..6bd01e8 100644
--- a/src/backend/executor/nodeWorktablescan.c
+++ b/src/backend/executor/nodeWorktablescan.c
@@ -144,6 +144,7 @@ ExecInitWorkTableScan(WorkTableScan *node, EState *estate, int eflags)
 	scanstate = makeNode(WorkTableScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.ps_numTuples = -1;
 	scanstate->rustate = NULL;	/* we'll set this later */
 
 	/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f6f73f3..f641b93 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1066,6 +1066,8 @@ typedef struct PlanState
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 	bool		ps_TupFromTlist;/* state flag for processing set-valued
 								 * functions in targetlist */
+	long		ps_numTuples;	/* number of tuples required by the upper
+								 * node if any. -1 means all tuples */
 } PlanState;
 
 /* ----------------
@@ -1114,6 +1116,7 @@ typedef struct ResultState
 {
 	PlanState	ps;				/* its first field is NodeTag */
 	ExprState  *resconstantqual;
+	bool		rs_started;		/* are we already called? */
 	bool		rs_done;		/* are we done? */
 	bool		rs_checkqual;	/* do we need to check the qual? */
 } ResultState;
#2Jeevan Chalke
jeevan.chalke@enterprisedb.com
In reply to: Kouhei Kaigai (#1)
Re: PassDownLimitBound for ForeignScan/CustomScan [take-2]

Hi,

I have reviewed the patch.

Patch applies cleanly. make/"make install"/"make check" all are fine.

I too see a good performance in execution time when LIMIT is pushed with
cursor query in postgres_fdw at execution time

However here are few comments on the changes:

1. ps_numTuples is declared as long, however offset and count members in
LimitState struct and bound member in SortState struct is int64. However
long on 32 bit machine may be 32 bits and thus I think tuples_needed which
is long may have overflow hazards as it may store int64 + int64. I think
ps_numTuples should be int64.

2. Robert suggested following in the previous discussion:
"For example, suppose we add a new PlanState member "long
numTuples" where 0 means that the number of tuples that will be needed
is unknown (so that most node types need not initialize it), a
positive value is an upper bound on the number of tuples that will be
fetched, and -1 means that it is known for certain that we will need
all of the tuples."

We should have 0 for the default case so that we don't need to initialize it
at most of the places. But I see many such changes in the patch. I think
this is not possible here since 0 can be a legal user provided value which
cannot be set as a default (default is all rows).

However do you think, can we avoid that? Is there any other way so that we
don't need every node having ps_numTuples to be set explicitly?

Apart from this patch look good to me.

Regards,
--
Jeevan Chalke
Principal Software Engineer, Product Development
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#3Robert Haas
robertmhaas@gmail.com
In reply to: Kouhei Kaigai (#1)
Re: PassDownLimitBound for ForeignScan/CustomScan [take-2]

On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

So there are two cases here. If the user says LIMIT 12, we could in
theory know that at planner time and optimize accordingly. If the
user says LIMIT twelve(), however, we will need to wait until
execution time unless twelve() happens to be capable of being
simplified to a constant by the planner.

Therefore, it's possible to imagine having two mechanisms here. In the
simple case where the LIMIT and OFFSET values are constants, we could
implement a system to get hold of that information during planning and
use it for whatever we like. In addition, we can have an
execution-time system that optimizes based on values available at
execution (regardless of whether those values were also available
during planning). Those are, basically, two separate things, and this
patch has enough to do just focusing on one of them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Robert Haas
robertmhaas@gmail.com
In reply to: Jeevan Chalke (#2)
Re: PassDownLimitBound for ForeignScan/CustomScan [take-2]

On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:

1. ps_numTuples is declared as long, however offset and count members in
LimitState struct and bound member in SortState struct is int64. However
long on 32 bit machine may be 32 bits and thus I think tuples_needed which
is long may have overflow hazards as it may store int64 + int64. I think
ps_numTuples should be int64.

I suggested long originally because that's what ExecutorRun() was
using at the time. It seems that it got changed to uint64 in
23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
probably use uint64.

2. Robert suggested following in the previous discussion:
"For example, suppose we add a new PlanState member "long
numTuples" where 0 means that the number of tuples that will be needed
is unknown (so that most node types need not initialize it), a
positive value is an upper bound on the number of tuples that will be
fetched, and -1 means that it is known for certain that we will need
all of the tuples."

We should have 0 for the default case so that we don't need to initialize it
at most of the places. But I see many such changes in the patch. I think
this is not possible here since 0 can be a legal user provided value which
cannot be set as a default (default is all rows).

However do you think, can we avoid that? Is there any other way so that we
don't need every node having ps_numTuples to be set explicitly?

+1.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Robert Haas (#3)
Re: PassDownLimitBound for ForeignScan/CustomScan [take-2]

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
Sent: Thursday, November 10, 2016 3:08 AM
To: Kaigai Kouhei(海外 浩平)
Cc: pgsql-hackers@postgresql.org; Jeevan Chalke; Etsuro Fujita; Andres Freund
Subject: Re: [HACKERS] PassDownLimitBound for ForeignScan/CustomScan [take-2]

On Mon, Oct 31, 2016 at 10:20 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

As an example, I enhanced postgres_fdw to understand the ps_numTuples
if it is set. If and when remote ORDER BY is pushed down, the latest
code tries to sort the entire remote table because it does not know
how many rows to be returned. Thus, it took larger execution time.
On the other hands, the patched one runs the remote query with LIMIT
clause according to the ps_numTuples; which is informed by the Limit
node on top of the ForeignScan node.

So there are two cases here. If the user says LIMIT 12, we could in
theory know that at planner time and optimize accordingly. If the
user says LIMIT twelve(), however, we will need to wait until
execution time unless twelve() happens to be capable of being
simplified to a constant by the planner.

Therefore, it's possible to imagine having two mechanisms here. In the
simple case where the LIMIT and OFFSET values are constants, we could
implement a system to get hold of that information during planning and
use it for whatever we like. In addition, we can have an
execution-time system that optimizes based on values available at
execution (regardless of whether those values were also available
during planning). Those are, basically, two separate things, and this
patch has enough to do just focusing on one of them.

OK, we need to have a private value of postgres_fdw to indicate whether
LIMIT and OFFSET were supplied at the planner stage. If any, it has to
be matched with the ps_numTuples informed at the executor stage.

I'll revise the patch.
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Kouhei Kaigai (#5)
Re: PassDownLimitBound for ForeignScan/CustomScan [take-2]

On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:

1. ps_numTuples is declared as long, however offset and count members in
LimitState struct and bound member in SortState struct is int64. However
long on 32 bit machine may be 32 bits and thus I think tuples_needed which
is long may have overflow hazards as it may store int64 + int64. I think
ps_numTuples should be int64.

I suggested long originally because that's what ExecutorRun() was
using at the time. It seems that it got changed to uint64 in
23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
probably use uint64.

2. Robert suggested following in the previous discussion:
"For example, suppose we add a new PlanState member "long
numTuples" where 0 means that the number of tuples that will be needed
is unknown (so that most node types need not initialize it), a
positive value is an upper bound on the number of tuples that will be
fetched, and -1 means that it is known for certain that we will need
all of the tuples."

We should have 0 for the default case so that we don't need to initialize it
at most of the places. But I see many such changes in the patch. I think
this is not possible here since 0 can be a legal user provided value which
cannot be set as a default (default is all rows).

However do you think, can we avoid that? Is there any other way so that we
don't need every node having ps_numTuples to be set explicitly?

+1.

I thought we have to distinguish a case if LIMIT 0 is supplied.
However, in this case, ExecLimit() never goes down to the child nodes,
thus, its ps_numTuples shall not be referenced anywhere.

OK, I'll use uint64 for ps_numTuples, and 0 shall be a usual default
value that means no specific number of rows are given.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Haribabu Kommi
kommi.haribabu@gmail.com
In reply to: Kouhei Kaigai (#6)
Re: PassDownLimitBound for ForeignScan/CustomScan [take-2]

On Thu, Nov 10, 2016 at 10:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com>
wrote:

On Tue, Nov 8, 2016 at 6:54 AM, Jeevan Chalke
<jeevan.chalke@enterprisedb.com> wrote:

1. ps_numTuples is declared as long, however offset and count members

in

LimitState struct and bound member in SortState struct is int64.

However

long on 32 bit machine may be 32 bits and thus I think tuples_needed

which

is long may have overflow hazards as it may store int64 + int64. I

think

ps_numTuples should be int64.

I suggested long originally because that's what ExecutorRun() was
using at the time. It seems that it got changed to uint64 in
23a27b039d94ba359286694831eafe03cd970eef, so I guess we should
probably use uint64.

2. Robert suggested following in the previous discussion:
"For example, suppose we add a new PlanState member "long
numTuples" where 0 means that the number of tuples that will be needed
is unknown (so that most node types need not initialize it), a
positive value is an upper bound on the number of tuples that will be
fetched, and -1 means that it is known for certain that we will need
all of the tuples."

We should have 0 for the default case so that we don't need to

initialize it

at most of the places. But I see many such changes in the patch. I

think

this is not possible here since 0 can be a legal user provided value

which

cannot be set as a default (default is all rows).

However do you think, can we avoid that? Is there any other way so

that we

don't need every node having ps_numTuples to be set explicitly?

+1.

I thought we have to distinguish a case if LIMIT 0 is supplied.
However, in this case, ExecLimit() never goes down to the child nodes,
thus, its ps_numTuples shall not be referenced anywhere.

OK, I'll use uint64 for ps_numTuples, and 0 shall be a usual default
value that means no specific number of rows are given.

Marked as "returned with feedback" in 2016-11 commitfest.
Please feel free to update the status when you submit the latest patch.

Regards,
Hari Babu
Fujitsu Australia