[GSoC] Push-based query executor discussion

Started by Arseny Sheralmost 9 years ago13 messages
#1Arseny Sher
sher-ars@yandex.ru

Hello,

I would like to work on push-based executor [1]https://wiki.postgresql.org/wiki/GSoC_2017#Implementing_push-based_query_executor during GSoC, so I'm
writing to introduce myself and start the discussion of the project. I
think I should mention beforehand that the subject is my master's
thesis topic, and I have already started working on it. This letter is
not (obviously) a ready proposal but rather initial point to talk over
the concept. Below you can see a short review of the idea, description
of benefits for the community, details, related work and some info
about me.

*Brief review*
The idea is described at the wiki page [1]https://wiki.postgresql.org/wiki/GSoC_2017#Implementing_push-based_query_executor and in the letter [2]/messages/by-id/CAFRJ5K3sDGSn=pKgnsobYQX4CMTOU=0uJ-vt2kF3t1FsVnTCRQ@mail.gmail.com. I
propose to replace current ExecProcNode interface between execution
nodes with function called, say, pushTuple that pushes the ready tuple
to the current node's parent.

*Benefits for the community*
Why would we want this? In general, because Postgres executor is slow
for CPU-bound queries and this approach should accelerate it. [4]Efficiently Compiling Efficient Query Plans for Modern Hardware, http://www.vldb.org/pvldb/vol4/p539-neumann.pdf and
[5]: Compiling Database Queries into Machine Code, http://sites.computer.org/debull/A14mar/p3.pdf
and that JIT compilation makes the difference even more drastic.

Besides, while working on this, in order to study the effects of model
change I will try to investigate the Postgres executor's performance
in both models extensively. For instance, it is commonly accepted that
current Volcano-style model leads to poor usage of modern CPUs
pipelining abilities and large percent of branch mispredictions. I am
going to see whether, where and when this is true in Postgres;
profiling results should be useful for any further executor
optimizations.

*Project details*
Technically, I am planning to implement this as follows. Common for
all nodes code which needs to be changed is in execMain.c and
execProcnode.c; standard_ExecutorRun in execMain.c now should start
execution of all leaf nodes in proper order instead of pulling tuples
one-by-one from top-level node. By 'proper' order here I mean that
inner nodes will be run first, outer nodes second, so that when the
first tuple from outer side of some node arrives to it, the node
already received all its tuples from the inner side.

How we 'start' execution of a leaf? Recall that now instead of
ExecProcNode we have pushTuple function with following signature:

bool pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)

'slot' is the tuple we push. 'node' is a receiver of tuple, 'pusher'
is sender of the tuple, its parent is 'node'. We need 'pusher' only to
distinguish inner and outer pushes. This function returns true if
'node' is still accepting tuples after the push, false if not,
e.g. Limit node can return false after required number of tuples were
passed. We also add the convention that when a node has nothing to
push anymore, it calls pushTuple with slot=NULL to let parent know
that it is done. So, to start execution of a leaf,
standard_ExecutorRun basically needs to call pushTuple(NULL, leaf,
NULL) once. Leaf nodes are a special case because pusher=NULL; another
obvious special case is top-level node: it calls pushTuple(slot, NULL,
node), such call will push the slot to the destination
((*dest->receiveSlot) (slot, dest) in current code).

Like ExecProcNode, pushTuple will call the proper implementation, e.g.
pushTupleToLimit. Such implementations will contain the code similar
to its analogue (e.g. ExecLimit), but, very roughly, where we have

return slot;

now, in push model we will have

bool parent_accepts_tuples = pushTuple(slot, node->parent, node);

and then we will continue execution if parent_accepts_tuples is true
or exit if not.

Complex nodes require more complicated modifications to preserve the
correct behaviour and be efficient. The latter leads to some
architectural issues: for example, efficient SeqScan should call
pushTuple from function doing similar to what heapgettups_pagemode
currently does, otherwise, we would need to save/restore its state
(lines, page, etc) every time for each tuple. On the other hand, it is
not nice to call pushTuple from there because currently access level
(heapam.c) knows nothing about PlanStates. Such issues will need to be
addressed and discussed with the community.

Currently, I have a prototype (pretty much WIP) which implements this
model for SeqScan, Limit, Hash and Hashjoin nodes.

Since TPC-H benchmarks are de facto standard to evaluate such things,
I am planning to to use them for testing. BTW, I’ve written a couple
of scripts to automate this job [16]https://github.com/arssher/pgtpch, although it seems that everyone
who tests TPC-H ends up with writing his own version.

Now, it is clear that rewriting all nodes with full support in such a
manner is huge work. Besides, we still don't know quantitative profit
of this model. Because of that, I do not propose any timeline right
now; instead, we should decide which part of this work (if any) is
going to be done in the course of GSoC. Probably, all TPC-H queries
with and without index support is a good initial target, but this
needs to be discussed. Anyway, I don't think that the result will be a
patch ready to be merged into Postgres master straight away, because
it is rather radical change; and it seems that supporting both
executors simultaneously is also a bad idea because many code would be
duplicated in this case.

*Related work*
There are other works aimed for improving executor performance. I can
mention at least four approaches:
* JITing things [6]LLVM Cauldron, slides http://llvm.org/devmtg/2016-09/slides/Melnik-PostgreSQLLLVM.pdf[10]/messages/by-id/20161206034955.bh33paeralxbtluv@alap3.anarazel.de[17]/messages/by-id/CADviLuNjQTh99o6E0LTi0Ygks=naW8SXHmgn=8P+aaBXKXa0pA@mail.gmail.com
* batched and/or vectorized execution [7]/messages/by-id/CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com[8]/messages/by-id/20160624232953.beub22r6yqux4gcp@alap3.anarazel.de[9]/messages/by-id/50877c0c-fb88-b601-3115-55a8c70d693e@postgrespro.ru
* expression evaluation optimizations [10]/messages/by-id/20161206034955.bh33paeralxbtluv@alap3.anarazel.de[17]/messages/by-id/CADviLuNjQTh99o6E0LTi0Ygks=naW8SXHmgn=8P+aaBXKXa0pA@mail.gmail.com
* slot deforming optimizations [10]/messages/by-id/20161206034955.bh33paeralxbtluv@alap3.anarazel.de

The latter two are orthogonal to the proposed project.
Batched execution and JIT can be applied together, and some study [10]/messages/by-id/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
shows benefits of such combined approach.

While batched execution and push-based execution model can be applied
together too, they seemingly solve more or less same problems -- code
and data locality, avoiding reloading node's state and better use of
modern CPU features. However, batched execution requires massive
changes to the current code and seems harder to implement; IIRC I have
seen some patches on this at mailing lists, but I am not aware which
work is the most complete as of now. It is not easy to compare these
approaches theoretically; probably, the best way to estimate their
effect is to implement them and run benchmarks.

Relationship between JIT-compilation and push-based execution model is
interesting: paper [5]Compiling Database Queries into Machine Code, http://sites.computer.org/debull/A14mar/p3.pdf shows that HyPer system with JIT + push runs 4x
times faster on some queries than JIT + pull. It should be noted that
HyPer uses column-wise storage though.
Full query compiler developed at ISP RAS [6]LLVM Cauldron, slides http://llvm.org/devmtg/2016-09/slides/Melnik-PostgreSQLLLVM.pdf speeds up query
processing 2-5 times on TPC-H queries and exploits push-based
execution model. However, supporting it requires implementing executor
logic twice: in plain C for usual interpreter and in LLVM API for JIT
compiler. Ideally we would like to write the code once and be able to
use it either with and without JIT compilation.
There is an ongoing work at ISP RAS to make this possible using
automatic runtime code specialization; however, experiments have
showed that specialization of original Postgres C code doesn't give
significant improvement because of Volcano-style model. We expect that
after making a switch to push-based model in Postgres code we will
achieve speedup comparable to full-query JIT using runtime
specialization.

*About me*
My name is Arseny Sher. Currently, I am studying master's degree at
CMC MSU [12]https://cs.msu.ru/en and working at ISP RAS [13]http://www.ispras.ru/en/. Earlier I got bachelor's
degree at the same faculty. I started working with Postgres at the end
of October and I love its extensibility and excellent quality of
code. My previous work was mainly connected with big graphs
computation (keywords are Spark, GraphX, Scala, GraphLab). I also did
some scala.js coding for Russian philologists and participated in
project for IMDG performance comparison, doing mostly devops stuff
(Docker, Ansible, Python). Here are my stackoverflow [14]http://stackoverflow.com/users/4014587/ars & github
accounts [15]https://github.com/arssher.

The idea of this project was born when my colleagues working on JITing
Postgres realized that runtime specialization of C code written in
push-based architecture should be much more efficient than
specializing existing code (see 'Related work' section), and at the
time I’ve decided that I want my thesis to be connected with
PostgreSQL.

I am ready to work full-time this summer and I think that push-based
execution of all TPC-H queries is quite an achievable goal; however, I
haven't yet studied all required nodes in detail and I will make more
exact estimations if the community supports this project.

P.S. Should letters like this go to hackers or students mailing list?
The latter seems more suitable, but it looks rather dead...

____________________________________________________________
References

[1]: https://wiki.postgresql.org/wiki/GSoC_2017#Implementing_push-based_query_executor
[2]: /messages/by-id/CAFRJ5K3sDGSn=pKgnsobYQX4CMTOU=0uJ-vt2kF3t1FsVnTCRQ@mail.gmail.com
[4]: Efficiently Compiling Efficient Query Plans for Modern Hardware, http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
[5]: Compiling Database Queries into Machine Code, http://sites.computer.org/debull/A14mar/p3.pdf
http://sites.computer.org/debull/A14mar/p3.pdf
[6]: LLVM Cauldron, slides http://llvm.org/devmtg/2016-09/slides/Melnik-PostgreSQLLLVM.pdf
http://llvm.org/devmtg/2016-09/slides/Melnik-PostgreSQLLLVM.pdf
[7]: /messages/by-id/CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com
[8]: /messages/by-id/20160624232953.beub22r6yqux4gcp@alap3.anarazel.de
[9]: /messages/by-id/50877c0c-fb88-b601-3115-55a8c70d693e@postgrespro.ru
[10]: /messages/by-id/20161206034955.bh33paeralxbtluv@alap3.anarazel.de
[11]: Vectorization vs. Compilation in Query Execution https://pdfs.semanticscholar.org/dcee/b1e11d3b078b0157325872a581b51402ff66.pdf
https://pdfs.semanticscholar.org/dcee/b1e11d3b078b0157325872a581b51402ff66.pdf
[12]: https://cs.msu.ru/en
[13]: http://www.ispras.ru/en/
[14]: http://stackoverflow.com/users/4014587/ars
[15]: https://github.com/arssher
[16]: https://github.com/arssher/pgtpch
[17]: /messages/by-id/CADviLuNjQTh99o6E0LTi0Ygks=naW8SXHmgn=8P+aaBXKXa0pA@mail.gmail.com

--
Arseny Sher

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Robert Haas
robertmhaas@gmail.com
In reply to: Arseny Sher (#1)
Re: [HACKERS] [GSoC] Push-based query executor discussion

On Mon, Mar 6, 2017 at 11:20 AM, Arseny Sher <sher-ars@yandex.ru> wrote:

I would like to work on push-based executor [1] during GSoC, so I'm
writing to introduce myself and start the discussion of the project. I
think I should mention beforehand that the subject is my master's
thesis topic, and I have already started working on it. This letter is
not (obviously) a ready proposal but rather initial point to talk over
the concept. Below you can see a short review of the idea, description
of benefits for the community, details, related work and some info
about me.

While I admire your fearlessness, I think the chances of you being
able to bring a project of this type to a successful conclusion are
remote. Here is what I said about this topic previously:

/messages/by-id/CA+Tgmoa=kzHJ+TwxyQ+vKu21nk3prkRjSdbhjubN7qvc8UKuGg@mail.gmail.com

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-students mailing list (pgsql-students@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-students

#3Arseny Sher
sher-ars@ispras.ru
In reply to: Robert Haas (#2)
8 attachment(s)
Re: [GSoC] Push-based query executor discussion

I will share the actual benchmarks and the code to give it another
chance and give an idea how the result looks like. Currently I have
implemented suggested changes to nodes seqscan, hash, hashjoin, limit,
hashed aggregation and in-memory sort. It allows to test q1, q3, q4, q5,
q10, q12 and q14 queries from TPC-H set. Since my goal was just to
estimate performance benefits, there are several restictions:
* ExecReScan is not supported
* only CMD_SELECT operations currently work
* only forward direction is supported.
SRF, subplanes and parallel execution are not supported either because
corresponding nodes are not yet implemented.

Here you can see the results:

+-----+-----------+---------+----------+
|query|reversed, s|master, s|speedup, %|
+-----+-----------+---------+----------+
|q01 |128.53 |138.94 |8.1 |
+-----+-----------+---------+----------+
|q03 |61.53 |67.29 |9.36 |
+-----+-----------+---------+----------+
|q04 |86.27 |95.95 |11.22 |
+-----+-----------+---------+----------+
|q05 |54.44 |56.82 |4.37 |
+-----+-----------+---------+----------+
|q10 |55.44 |59.88 |8.01 |
+-----+-----------+---------+----------+
|q12 |69.59 |76.65 |10.15 |
+-----+-----------+---------+----------+

'reversed' is Postgres with push-based executor, master' is current
master branch. 24 runs were conducted and median of them was
taken. Speedup in % is (master - reversed) / reversed * 100. Scale of
TPC-H database was 40. We use doubles as floating point types instead of
numerics. Only q1 here is fully supported, meaning that that the planner
would anyway choose this plan, even if all other nodes were
implemented. For other queries planner also uses Index Scan, Nested Loop
Semi Join, bitmap scans, Materialize, which are not yet reversed.

postgresql.conf was

shared_buffers = 128GB
maintenance_work_mem = 1GB
work_mem = 8GB
effective_cache_size = 128GB

max_wal_senders = 0
max_parallel_workers_per_gather = 0 # disable parallelism

# disable not yet implemented nodes
set enable_indexscan TO off;
set enable_indexonlyscan TO off;
set enable_material TO off;
set enable_bitmapscan TO off;
set enable_nestloop TO off;
set enable_sort TO off;

Patches are attached, they apply cleanly on 767ce36ff36747.

While in some places patches introduce kind of ugliness which is
described in commit messages and commits, e.g. heapam.h now must know
about PlanState *, I think in others this approach can make the
architecture a bit cleaner. Specifically, now we can cleanly separate
logic for handling tuples from inner and outer sides (see hashjoin), and
also separate logic for handling NULL tuples. I haven't yet added the
latter, but the idea is that the node below always knows when it is
done, so it can call its parent function for handling null tuples
directly instead of keeping extra 'if' in generic
execProcNode/pushTuple.

--
Arseny Sher

Attachments:

0001-parent-param-added-to-ExecInitNode-parent-field-adde.patchtext/x-diffDownload
From e2bfb13eab7e06dd6691ccdfba54166a7bf3ba8c Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 15:02:37 +0300
Subject: [PATCH 1/8] parent param added to ExecInitNode, parent field added to
 PlanState

---
 src/backend/executor/execMain.c           | 8 ++++----
 src/backend/executor/execProcnode.c       | 3 ++-
 src/backend/executor/nodeAgg.c            | 2 +-
 src/backend/executor/nodeAppend.c         | 2 +-
 src/backend/executor/nodeBitmapAnd.c      | 2 +-
 src/backend/executor/nodeBitmapHeapscan.c | 2 +-
 src/backend/executor/nodeBitmapOr.c       | 2 +-
 src/backend/executor/nodeForeignscan.c    | 2 +-
 src/backend/executor/nodeGather.c         | 2 +-
 src/backend/executor/nodeGatherMerge.c    | 2 +-
 src/backend/executor/nodeGroup.c          | 2 +-
 src/backend/executor/nodeHash.c           | 3 ++-
 src/backend/executor/nodeHashjoin.c       | 6 ++++--
 src/backend/executor/nodeLimit.c          | 2 +-
 src/backend/executor/nodeLockRows.c       | 2 +-
 src/backend/executor/nodeMaterial.c       | 2 +-
 src/backend/executor/nodeMergeAppend.c    | 2 +-
 src/backend/executor/nodeMergejoin.c      | 5 +++--
 src/backend/executor/nodeModifyTable.c    | 2 +-
 src/backend/executor/nodeNestloop.c       | 4 ++--
 src/backend/executor/nodeProjectSet.c     | 2 +-
 src/backend/executor/nodeRecursiveunion.c | 4 ++--
 src/backend/executor/nodeResult.c         | 2 +-
 src/backend/executor/nodeSetOp.c          | 2 +-
 src/backend/executor/nodeSort.c           | 2 +-
 src/backend/executor/nodeSubqueryscan.c   | 2 +-
 src/backend/executor/nodeUnique.c         | 2 +-
 src/backend/executor/nodeWindowAgg.c      | 2 +-
 src/include/executor/executor.h           | 3 ++-
 src/include/nodes/execnodes.h             | 1 +
 30 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f5cd65d8a0..efb3f30dd0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -955,7 +955,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
-		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
+		subplanstate = ExecInitNode(subplan, estate, sp_eflags, NULL);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
@@ -968,7 +968,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	 * tree.  This opens files, allocates storage and leaves us ready to start
 	 * processing tuples.
 	 */
-	planstate = ExecInitNode(plan, estate, eflags);
+	planstate = ExecInitNode(plan, estate, eflags, NULL);
 
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
@@ -3006,7 +3006,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;
 
-		subplanstate = ExecInitNode(subplan, estate, 0);
+		subplanstate = ExecInitNode(subplan, estate, 0, NULL);
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
 	}
@@ -3016,7 +3016,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * of the plan tree we need to run.  This opens files, allocates storage
 	 * and leaves us ready to start processing tuples.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->planstate = ExecInitNode(planTree, estate, 0, NULL);
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 80c77addb8..c1c4cecd6c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -131,12 +131,13 @@
  *		  'node' is the current node of the plan produced by the query planner
  *		  'estate' is the shared execution state for the plan tree
  *		  'eflags' is a bitwise OR of flag bits described in executor.h
+ *        'parent' is parent of the node
  *
  *		Returns a PlanState node corresponding to the given Plan node.
  * ------------------------------------------------------------------------
  */
 PlanState *
-ExecInitNode(Plan *node, EState *estate, int eflags)
+ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 {
 	PlanState  *result;
 	List	   *subps;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3207ee460c..fa19358d19 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2523,7 +2523,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * initialize source tuple type.
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6986caee6b..752b22d219 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -165,7 +165,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index e4eb028ff9..c2a2f7d30a 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -81,7 +81,7 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2e9ff7d1b9..c0bcfb5d98 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -903,7 +903,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * relation's indexes, and we want to be sure we have acquired a lock on
 	 * the relation first.
 	 */
-	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * all done.
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index c0f261407b..c834e29abb 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -82,7 +82,7 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 3b6d1390eb..2e6ceb8b34 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -222,7 +222,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	/* Initialize any outer plan. */
 	if (outerPlan(node))
 		outerPlanState(scanstate) =
-			ExecInitNode(outerPlan(node), estate, eflags);
+			ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * Tell the FDW to initialize the scan.
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 32c97d390e..0031898acf 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -98,7 +98,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags, NULL);
 
 	/*
 	 * Initialize result tuple type and projection info.
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 72f30ab4e6..7ed0c5bc0c 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -102,7 +102,7 @@ ExecInitGatherMerge(GatherMerge *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gm_state) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gm_state) = ExecInitNode(outerNode, estate, eflags, NULL);
 
 	/*
 	 * Initialize result tuple type and projection info.
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 66c095bc72..5338e29187 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -198,7 +198,7 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * initialize tuple type.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index e695d8834b..43e65ca04e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -200,7 +200,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags,
+											 (PlanState*) hashstate);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c50d93f43d..b48863f90b 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -435,8 +435,10 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	outerNode = outerPlan(node);
 	hashNode = (Hash *) innerPlan(node);
 
-	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags);
-	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
+	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags,
+										   (PlanState *) hjstate);
+	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags,
+										   (PlanState *) hjstate);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index aaec132218..bcacbfc13b 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -403,7 +403,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b098034337..446a5e6fb3 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -376,7 +376,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * LockRows nodes do no projections, so initialize projection info for
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index aa5d2529f4..97d025977f 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -219,7 +219,7 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
 	outerPlan = outerPlan(node);
-	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 7a20bf07a4..0327cf9a2a 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -112,7 +112,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 105e2dcedb..68c53ba1fe 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1473,9 +1473,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 *
 	 * inner child must support MARK/RESTORE.
 	 */
-	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 	innerPlanState(mergestate) = ExecInitNode(innerPlan(node), estate,
-											  eflags | EXEC_FLAG_MARK);
+											  eflags | EXEC_FLAG_MARK,
+											  NULL);
 
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 95e158970c..ee6e4e7946 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1703,7 +1703,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
-		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
+		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags, NULL);
 
 		/* Also let FDWs init themselves for foreign-table result rels */
 		if (!resultRelInfo->ri_usesFdwDirectModify &&
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index cac7ba1b9b..697f5d48a2 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -302,12 +302,12 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
-	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
-	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
+	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeProjectSet.c b/src/backend/executor/nodeProjectSet.c
index eae0f1dad9..0c61685430 100644
--- a/src/backend/executor/nodeProjectSet.c
+++ b/src/backend/executor/nodeProjectSet.c
@@ -240,7 +240,7 @@ ExecInitProjectSet(ProjectSet *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(state) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(state) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index fc1c00d68f..4b91f155c9 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -235,8 +235,8 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags);
-	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags);
+	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
+	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * If hashing, precompute fmgr lookup data for inner loop, and create the
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index b5b50b21e9..bbc0c82c3f 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -221,7 +221,7 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 85b3f67b33..f437ec1044 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -526,7 +526,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
-	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * setop nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 591a31aa6a..0028912509 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -199,7 +199,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 230a96f9d2..b3cbe266dc 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -136,7 +136,7 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	/*
 	 * initialize subquery
 	 */
-	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags);
+	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags, NULL);
 
 	/*
 	 * Initialize scan tuple type (needed by ExecAssignScanProjectionInfo)
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 28cc1e90f8..244c49f2dc 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -143,7 +143,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * unique nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 2a123e8452..39971458d1 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1841,7 +1841,7 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * initialize source tuple type (which is also the tuple type that we'll
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 02dbe7b228..716362970f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -234,7 +234,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 /*
  * prototypes from functions in execProcnode.c
  */
-extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
+extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags,
+	PlanState *parent);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f856f6036f..738f098b00 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1062,6 +1062,7 @@ typedef struct PlanState
 	 */
 	List	   *targetlist;		/* target list to be computed at this node */
 	List	   *qual;			/* implicitly-ANDed qual conditions */
+	struct PlanState *parent;   /* parent node, NULL if root */
 	struct PlanState *lefttree; /* input plan tree(s) */
 	struct PlanState *righttree;
 	List	   *initPlan;		/* Init SubPlanState nodes (un-correlated expr
-- 
2.11.0

0002-Node-s-interface-functions-stubbed.patchtext/x-diffDownload
From 913a6d0352d21e9c91da683e48cf594d620e763c Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 15:39:12 +0300
Subject: [PATCH 2/8] Node's interface functions stubbed

Namely, ExecProcNode, ExecInitNode, ExecEndNode, MultiExecProcNode, ExecRescan,
ExecutorRewind. It breaks the existing executor.
---
 src/backend/executor/execAmi.c      | 213 +-----------
 src/backend/executor/execMain.c     |  26 +-
 src/backend/executor/execProcnode.c | 633 +-----------------------------------
 3 files changed, 16 insertions(+), 856 deletions(-)

diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 5d59f95a91..a447cb92ba 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -73,218 +73,7 @@ static bool IndexSupportsBackwardScan(Oid indexid);
 void
 ExecReScan(PlanState *node)
 {
-	/* If collecting timing stats, update them */
-	if (node->instrument)
-		InstrEndLoop(node->instrument);
-
-	/*
-	 * If we have changed parameters, propagate that info.
-	 *
-	 * Note: ExecReScanSetParamPlan() can add bits to node->chgParam,
-	 * corresponding to the output param(s) that the InitPlan will update.
-	 * Since we make only one pass over the list, that means that an InitPlan
-	 * can depend on the output param(s) of a sibling InitPlan only if that
-	 * sibling appears earlier in the list.  This is workable for now given
-	 * the limited ways in which one InitPlan could depend on another, but
-	 * eventually we might need to work harder (or else make the planner
-	 * enlarge the extParam/allParam sets to include the params of depended-on
-	 * InitPlans).
-	 */
-	if (node->chgParam != NULL)
-	{
-		ListCell   *l;
-
-		foreach(l, node->initPlan)
-		{
-			SubPlanState *sstate = (SubPlanState *) lfirst(l);
-			PlanState  *splan = sstate->planstate;
-
-			if (splan->plan->extParam != NULL)	/* don't care about child
-												 * local Params */
-				UpdateChangedParamSet(splan, node->chgParam);
-			if (splan->chgParam != NULL)
-				ExecReScanSetParamPlan(sstate, node);
-		}
-		foreach(l, node->subPlan)
-		{
-			SubPlanState *sstate = (SubPlanState *) lfirst(l);
-			PlanState  *splan = sstate->planstate;
-
-			if (splan->plan->extParam != NULL)
-				UpdateChangedParamSet(splan, node->chgParam);
-		}
-		/* Well. Now set chgParam for left/right trees. */
-		if (node->lefttree != NULL)
-			UpdateChangedParamSet(node->lefttree, node->chgParam);
-		if (node->righttree != NULL)
-			UpdateChangedParamSet(node->righttree, node->chgParam);
-	}
-
-	/* Call expression callbacks */
-	if (node->ps_ExprContext)
-		ReScanExprContext(node->ps_ExprContext);
-
-	/* And do node-type-specific processing */
-	switch (nodeTag(node))
-	{
-		case T_ResultState:
-			ExecReScanResult((ResultState *) node);
-			break;
-
-		case T_ProjectSetState:
-			ExecReScanProjectSet((ProjectSetState *) node);
-			break;
-
-		case T_ModifyTableState:
-			ExecReScanModifyTable((ModifyTableState *) node);
-			break;
-
-		case T_AppendState:
-			ExecReScanAppend((AppendState *) node);
-			break;
-
-		case T_MergeAppendState:
-			ExecReScanMergeAppend((MergeAppendState *) node);
-			break;
-
-		case T_RecursiveUnionState:
-			ExecReScanRecursiveUnion((RecursiveUnionState *) node);
-			break;
-
-		case T_BitmapAndState:
-			ExecReScanBitmapAnd((BitmapAndState *) node);
-			break;
-
-		case T_BitmapOrState:
-			ExecReScanBitmapOr((BitmapOrState *) node);
-			break;
-
-		case T_SeqScanState:
-			ExecReScanSeqScan((SeqScanState *) node);
-			break;
-
-		case T_SampleScanState:
-			ExecReScanSampleScan((SampleScanState *) node);
-			break;
-
-		case T_GatherState:
-			ExecReScanGather((GatherState *) node);
-			break;
-
-		case T_IndexScanState:
-			ExecReScanIndexScan((IndexScanState *) node);
-			break;
-
-		case T_IndexOnlyScanState:
-			ExecReScanIndexOnlyScan((IndexOnlyScanState *) node);
-			break;
-
-		case T_BitmapIndexScanState:
-			ExecReScanBitmapIndexScan((BitmapIndexScanState *) node);
-			break;
-
-		case T_BitmapHeapScanState:
-			ExecReScanBitmapHeapScan((BitmapHeapScanState *) node);
-			break;
-
-		case T_TidScanState:
-			ExecReScanTidScan((TidScanState *) node);
-			break;
-
-		case T_SubqueryScanState:
-			ExecReScanSubqueryScan((SubqueryScanState *) node);
-			break;
-
-		case T_FunctionScanState:
-			ExecReScanFunctionScan((FunctionScanState *) node);
-			break;
-
-		case T_TableFuncScanState:
-			ExecReScanTableFuncScan((TableFuncScanState *) node);
-			break;
-
-		case T_ValuesScanState:
-			ExecReScanValuesScan((ValuesScanState *) node);
-			break;
-
-		case T_CteScanState:
-			ExecReScanCteScan((CteScanState *) node);
-			break;
-
-		case T_WorkTableScanState:
-			ExecReScanWorkTableScan((WorkTableScanState *) node);
-			break;
-
-		case T_ForeignScanState:
-			ExecReScanForeignScan((ForeignScanState *) node);
-			break;
-
-		case T_CustomScanState:
-			ExecReScanCustomScan((CustomScanState *) node);
-			break;
-
-		case T_NestLoopState:
-			ExecReScanNestLoop((NestLoopState *) node);
-			break;
-
-		case T_MergeJoinState:
-			ExecReScanMergeJoin((MergeJoinState *) node);
-			break;
-
-		case T_HashJoinState:
-			ExecReScanHashJoin((HashJoinState *) node);
-			break;
-
-		case T_MaterialState:
-			ExecReScanMaterial((MaterialState *) node);
-			break;
-
-		case T_SortState:
-			ExecReScanSort((SortState *) node);
-			break;
-
-		case T_GroupState:
-			ExecReScanGroup((GroupState *) node);
-			break;
-
-		case T_AggState:
-			ExecReScanAgg((AggState *) node);
-			break;
-
-		case T_WindowAggState:
-			ExecReScanWindowAgg((WindowAggState *) node);
-			break;
-
-		case T_UniqueState:
-			ExecReScanUnique((UniqueState *) node);
-			break;
-
-		case T_HashState:
-			ExecReScanHash((HashState *) node);
-			break;
-
-		case T_SetOpState:
-			ExecReScanSetOp((SetOpState *) node);
-			break;
-
-		case T_LockRowsState:
-			ExecReScanLockRows((LockRowsState *) node);
-			break;
-
-		case T_LimitState:
-			ExecReScanLimit((LimitState *) node);
-			break;
-
-		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			break;
-	}
-
-	if (node->chgParam != NULL)
-	{
-		bms_free(node->chgParam);
-		node->chgParam = NULL;
-	}
+	elog(ERROR, "ExecReScan not implemented yet");
 }
 
 /*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index efb3f30dd0..f629f0098f 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -509,30 +509,8 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
 void
 ExecutorRewind(QueryDesc *queryDesc)
 {
-	EState	   *estate;
-	MemoryContext oldcontext;
-
-	/* sanity checks */
-	Assert(queryDesc != NULL);
-
-	estate = queryDesc->estate;
-
-	Assert(estate != NULL);
-
-	/* It's probably not sensible to rescan updating queries */
-	Assert(queryDesc->operation == CMD_SELECT);
-
-	/*
-	 * Switch into per-query memory context
-	 */
-	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
-
-	/*
-	 * rescan plan
-	 */
-	ExecReScan(queryDesc->planstate);
-
-	MemoryContextSwitchTo(oldcontext);
+	elog(ERROR, "Rewinding not supported");
+	return;
 }
 
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c1c4cecd6c..649d1e58f6 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -131,7 +131,7 @@
  *		  'node' is the current node of the plan produced by the query planner
  *		  'estate' is the shared execution state for the plan tree
  *		  'eflags' is a bitwise OR of flag bits described in executor.h
- *        'parent' is parent of the node
+ *		  'parent' is parent of the node
  *
  *		Returns a PlanState node corresponding to the given Plan node.
  * ------------------------------------------------------------------------
@@ -140,8 +140,6 @@ PlanState *
 ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 {
 	PlanState  *result;
-	List	   *subps;
-	ListCell   *l;
 
 	/*
 	 * do nothing when we get to the end of a leaf on tree.
@@ -151,229 +149,13 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 
 	switch (nodeTag(node))
 	{
-			/*
-			 * control nodes
-			 */
-		case T_Result:
-			result = (PlanState *) ExecInitResult((Result *) node,
-												  estate, eflags);
-			break;
-
-		case T_ProjectSet:
-			result = (PlanState *) ExecInitProjectSet((ProjectSet *) node,
-													  estate, eflags);
-			break;
-
-		case T_ModifyTable:
-			result = (PlanState *) ExecInitModifyTable((ModifyTable *) node,
-													   estate, eflags);
-			break;
-
-		case T_Append:
-			result = (PlanState *) ExecInitAppend((Append *) node,
-												  estate, eflags);
-			break;
-
-		case T_MergeAppend:
-			result = (PlanState *) ExecInitMergeAppend((MergeAppend *) node,
-													   estate, eflags);
-			break;
-
-		case T_RecursiveUnion:
-			result = (PlanState *) ExecInitRecursiveUnion((RecursiveUnion *) node,
-														  estate, eflags);
-			break;
-
-		case T_BitmapAnd:
-			result = (PlanState *) ExecInitBitmapAnd((BitmapAnd *) node,
-													 estate, eflags);
-			break;
-
-		case T_BitmapOr:
-			result = (PlanState *) ExecInitBitmapOr((BitmapOr *) node,
-													estate, eflags);
-			break;
-
-			/*
-			 * scan nodes
-			 */
-		case T_SeqScan:
-			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
-												   estate, eflags);
-			break;
-
-		case T_SampleScan:
-			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
-													  estate, eflags);
-			break;
-
-		case T_IndexScan:
-			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
-													 estate, eflags);
-			break;
-
-		case T_IndexOnlyScan:
-			result = (PlanState *) ExecInitIndexOnlyScan((IndexOnlyScan *) node,
-														 estate, eflags);
-			break;
-
-		case T_BitmapIndexScan:
-			result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
-														   estate, eflags);
-			break;
-
-		case T_BitmapHeapScan:
-			result = (PlanState *) ExecInitBitmapHeapScan((BitmapHeapScan *) node,
-														  estate, eflags);
-			break;
-
-		case T_TidScan:
-			result = (PlanState *) ExecInitTidScan((TidScan *) node,
-												   estate, eflags);
-			break;
-
-		case T_SubqueryScan:
-			result = (PlanState *) ExecInitSubqueryScan((SubqueryScan *) node,
-														estate, eflags);
-			break;
-
-		case T_FunctionScan:
-			result = (PlanState *) ExecInitFunctionScan((FunctionScan *) node,
-														estate, eflags);
-			break;
-
-		case T_TableFuncScan:
-			result = (PlanState *) ExecInitTableFuncScan((TableFuncScan *) node,
-														 estate, eflags);
-			break;
-
-		case T_ValuesScan:
-			result = (PlanState *) ExecInitValuesScan((ValuesScan *) node,
-													  estate, eflags);
-			break;
-
-		case T_CteScan:
-			result = (PlanState *) ExecInitCteScan((CteScan *) node,
-												   estate, eflags);
-			break;
-
-		case T_WorkTableScan:
-			result = (PlanState *) ExecInitWorkTableScan((WorkTableScan *) node,
-														 estate, eflags);
-			break;
-
-		case T_ForeignScan:
-			result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
-													   estate, eflags);
-			break;
-
-		case T_CustomScan:
-			result = (PlanState *) ExecInitCustomScan((CustomScan *) node,
-													  estate, eflags);
-			break;
-
-			/*
-			 * join nodes
-			 */
-		case T_NestLoop:
-			result = (PlanState *) ExecInitNestLoop((NestLoop *) node,
-													estate, eflags);
-			break;
-
-		case T_MergeJoin:
-			result = (PlanState *) ExecInitMergeJoin((MergeJoin *) node,
-													 estate, eflags);
-			break;
-
-		case T_HashJoin:
-			result = (PlanState *) ExecInitHashJoin((HashJoin *) node,
-													estate, eflags);
-			break;
-
-			/*
-			 * materialization nodes
-			 */
-		case T_Material:
-			result = (PlanState *) ExecInitMaterial((Material *) node,
-													estate, eflags);
-			break;
-
-		case T_Sort:
-			result = (PlanState *) ExecInitSort((Sort *) node,
-												estate, eflags);
-			break;
-
-		case T_Group:
-			result = (PlanState *) ExecInitGroup((Group *) node,
-												 estate, eflags);
-			break;
-
-		case T_Agg:
-			result = (PlanState *) ExecInitAgg((Agg *) node,
-											   estate, eflags);
-			break;
-
-		case T_WindowAgg:
-			result = (PlanState *) ExecInitWindowAgg((WindowAgg *) node,
-													 estate, eflags);
-			break;
-
-		case T_Unique:
-			result = (PlanState *) ExecInitUnique((Unique *) node,
-												  estate, eflags);
-			break;
-
-		case T_Gather:
-			result = (PlanState *) ExecInitGather((Gather *) node,
-												  estate, eflags);
-			break;
-
-		case T_GatherMerge:
-			result = (PlanState *) ExecInitGatherMerge((GatherMerge *) node,
-													   estate, eflags);
-			break;
-
-		case T_Hash:
-			result = (PlanState *) ExecInitHash((Hash *) node,
-												estate, eflags);
-			break;
-
-		case T_SetOp:
-			result = (PlanState *) ExecInitSetOp((SetOp *) node,
-												 estate, eflags);
-			break;
-
-		case T_LockRows:
-			result = (PlanState *) ExecInitLockRows((LockRows *) node,
-													estate, eflags);
-			break;
-
-		case T_Limit:
-			result = (PlanState *) ExecInitLimit((Limit *) node,
-												 estate, eflags);
-			break;
-
 		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			elog(ERROR, "unrecognized/unsupported node type: %d",
+				 (int) nodeTag(node));
 			result = NULL;		/* keep compiler quiet */
 			break;
 	}
-
-	/*
-	 * Initialize any initPlans present in this node.  The planner put them in
-	 * a separate list for us.
-	 */
-	subps = NIL;
-	foreach(l, node->initPlan)
-	{
-		SubPlan    *subplan = (SubPlan *) lfirst(l);
-		SubPlanState *sstate;
-
-		Assert(IsA(subplan, SubPlan));
-		sstate = ExecInitSubPlan(subplan, result);
-		subps = lappend(subps, sstate);
-	}
-	result->initPlan = subps;
+	return NULL;
 
 	/* Set up instrumentation for this node if requested */
 	if (estate->es_instrument)
@@ -383,253 +165,27 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 }
 
 
-/* ----------------------------------------------------------------
- *		ExecProcNode
- *
- *		Execute the given node to return a(nother) tuple.
- * ----------------------------------------------------------------
+/*
+ * Unsupported, left to avoid deleting 19k lines of existing code
  */
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
-	TupleTableSlot *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
-	if (node->instrument)
-		InstrStartNode(node->instrument);
-
-	switch (nodeTag(node))
-	{
-			/*
-			 * control nodes
-			 */
-		case T_ResultState:
-			result = ExecResult((ResultState *) node);
-			break;
-
-		case T_ProjectSetState:
-			result = ExecProjectSet((ProjectSetState *) node);
-			break;
-
-		case T_ModifyTableState:
-			result = ExecModifyTable((ModifyTableState *) node);
-			break;
-
-		case T_AppendState:
-			result = ExecAppend((AppendState *) node);
-			break;
-
-		case T_MergeAppendState:
-			result = ExecMergeAppend((MergeAppendState *) node);
-			break;
-
-		case T_RecursiveUnionState:
-			result = ExecRecursiveUnion((RecursiveUnionState *) node);
-			break;
-
-			/* BitmapAndState does not yield tuples */
-
-			/* BitmapOrState does not yield tuples */
-
-			/*
-			 * scan nodes
-			 */
-		case T_SeqScanState:
-			result = ExecSeqScan((SeqScanState *) node);
-			break;
-
-		case T_SampleScanState:
-			result = ExecSampleScan((SampleScanState *) node);
-			break;
-
-		case T_IndexScanState:
-			result = ExecIndexScan((IndexScanState *) node);
-			break;
-
-		case T_IndexOnlyScanState:
-			result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
-			break;
-
-			/* BitmapIndexScanState does not yield tuples */
-
-		case T_BitmapHeapScanState:
-			result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
-			break;
-
-		case T_TidScanState:
-			result = ExecTidScan((TidScanState *) node);
-			break;
-
-		case T_SubqueryScanState:
-			result = ExecSubqueryScan((SubqueryScanState *) node);
-			break;
-
-		case T_FunctionScanState:
-			result = ExecFunctionScan((FunctionScanState *) node);
-			break;
-
-		case T_TableFuncScanState:
-			result = ExecTableFuncScan((TableFuncScanState *) node);
-			break;
-
-		case T_ValuesScanState:
-			result = ExecValuesScan((ValuesScanState *) node);
-			break;
-
-		case T_CteScanState:
-			result = ExecCteScan((CteScanState *) node);
-			break;
-
-		case T_WorkTableScanState:
-			result = ExecWorkTableScan((WorkTableScanState *) node);
-			break;
-
-		case T_ForeignScanState:
-			result = ExecForeignScan((ForeignScanState *) node);
-			break;
-
-		case T_CustomScanState:
-			result = ExecCustomScan((CustomScanState *) node);
-			break;
-
-			/*
-			 * join nodes
-			 */
-		case T_NestLoopState:
-			result = ExecNestLoop((NestLoopState *) node);
-			break;
-
-		case T_MergeJoinState:
-			result = ExecMergeJoin((MergeJoinState *) node);
-			break;
-
-		case T_HashJoinState:
-			result = ExecHashJoin((HashJoinState *) node);
-			break;
-
-			/*
-			 * materialization nodes
-			 */
-		case T_MaterialState:
-			result = ExecMaterial((MaterialState *) node);
-			break;
-
-		case T_SortState:
-			result = ExecSort((SortState *) node);
-			break;
-
-		case T_GroupState:
-			result = ExecGroup((GroupState *) node);
-			break;
-
-		case T_AggState:
-			result = ExecAgg((AggState *) node);
-			break;
-
-		case T_WindowAggState:
-			result = ExecWindowAgg((WindowAggState *) node);
-			break;
-
-		case T_UniqueState:
-			result = ExecUnique((UniqueState *) node);
-			break;
-
-		case T_GatherState:
-			result = ExecGather((GatherState *) node);
-			break;
-
-		case T_GatherMergeState:
-			result = ExecGatherMerge((GatherMergeState *) node);
-			break;
-
-		case T_HashState:
-			result = ExecHash((HashState *) node);
-			break;
-
-		case T_SetOpState:
-			result = ExecSetOp((SetOpState *) node);
-			break;
-
-		case T_LockRowsState:
-			result = ExecLockRows((LockRowsState *) node);
-			break;
-
-		case T_LimitState:
-			result = ExecLimit((LimitState *) node);
-			break;
-
-		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
-			break;
-	}
-
-	if (node->instrument)
-		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
-	return result;
+	elog(ERROR, "ExecProcNode is not supported");
+	return NULL;
 }
 
-
 /* ----------------------------------------------------------------
- *		MultiExecProcNode
- *
- *		Execute a node that doesn't return individual tuples
- *		(it might return a hashtable, bitmap, etc).  Caller should
- *		check it got back the expected kind of Node.
- *
- * This has essentially the same responsibilities as ExecProcNode,
- * but it does not do InstrStartNode/InstrStopNode (mainly because
- * it can't tell how many returned tuples to count).  Each per-node
- * function must provide its own instrumentation support.
+ * Unsupported too; we don't need it in push model
  * ----------------------------------------------------------------
  */
 Node *
 MultiExecProcNode(PlanState *node)
 {
-	Node	   *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
-	switch (nodeTag(node))
-	{
-			/*
-			 * Only node types that actually support multiexec will be listed
-			 */
-
-		case T_HashState:
-			result = MultiExecHash((HashState *) node);
-			break;
-
-		case T_BitmapIndexScanState:
-			result = MultiExecBitmapIndexScan((BitmapIndexScanState *) node);
-			break;
-
-		case T_BitmapAndState:
-			result = MultiExecBitmapAnd((BitmapAndState *) node);
-			break;
-
-		case T_BitmapOrState:
-			result = MultiExecBitmapOr((BitmapOrState *) node);
-			break;
-
-		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
-			break;
-	}
-
-	return result;
+	elog(ERROR, "MultiExecProcNode is not supported");
+	return NULL;
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecEndNode
  *
@@ -658,172 +214,9 @@ ExecEndNode(PlanState *node)
 
 	switch (nodeTag(node))
 	{
-			/*
-			 * control nodes
-			 */
-		case T_ResultState:
-			ExecEndResult((ResultState *) node);
-			break;
-
-		case T_ProjectSetState:
-			ExecEndProjectSet((ProjectSetState *) node);
-			break;
-
-		case T_ModifyTableState:
-			ExecEndModifyTable((ModifyTableState *) node);
-			break;
-
-		case T_AppendState:
-			ExecEndAppend((AppendState *) node);
-			break;
-
-		case T_MergeAppendState:
-			ExecEndMergeAppend((MergeAppendState *) node);
-			break;
-
-		case T_RecursiveUnionState:
-			ExecEndRecursiveUnion((RecursiveUnionState *) node);
-			break;
-
-		case T_BitmapAndState:
-			ExecEndBitmapAnd((BitmapAndState *) node);
-			break;
-
-		case T_BitmapOrState:
-			ExecEndBitmapOr((BitmapOrState *) node);
-			break;
-
-			/*
-			 * scan nodes
-			 */
-		case T_SeqScanState:
-			ExecEndSeqScan((SeqScanState *) node);
-			break;
-
-		case T_SampleScanState:
-			ExecEndSampleScan((SampleScanState *) node);
-			break;
-
-		case T_GatherState:
-			ExecEndGather((GatherState *) node);
-			break;
-
-		case T_GatherMergeState:
-			ExecEndGatherMerge((GatherMergeState *) node);
-			break;
-
-		case T_IndexScanState:
-			ExecEndIndexScan((IndexScanState *) node);
-			break;
-
-		case T_IndexOnlyScanState:
-			ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
-			break;
-
-		case T_BitmapIndexScanState:
-			ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
-			break;
-
-		case T_BitmapHeapScanState:
-			ExecEndBitmapHeapScan((BitmapHeapScanState *) node);
-			break;
-
-		case T_TidScanState:
-			ExecEndTidScan((TidScanState *) node);
-			break;
-
-		case T_SubqueryScanState:
-			ExecEndSubqueryScan((SubqueryScanState *) node);
-			break;
-
-		case T_FunctionScanState:
-			ExecEndFunctionScan((FunctionScanState *) node);
-			break;
-
-		case T_TableFuncScanState:
-			ExecEndTableFuncScan((TableFuncScanState *) node);
-			break;
-
-		case T_ValuesScanState:
-			ExecEndValuesScan((ValuesScanState *) node);
-			break;
-
-		case T_CteScanState:
-			ExecEndCteScan((CteScanState *) node);
-			break;
-
-		case T_WorkTableScanState:
-			ExecEndWorkTableScan((WorkTableScanState *) node);
-			break;
-
-		case T_ForeignScanState:
-			ExecEndForeignScan((ForeignScanState *) node);
-			break;
-
-		case T_CustomScanState:
-			ExecEndCustomScan((CustomScanState *) node);
-			break;
-
-			/*
-			 * join nodes
-			 */
-		case T_NestLoopState:
-			ExecEndNestLoop((NestLoopState *) node);
-			break;
-
-		case T_MergeJoinState:
-			ExecEndMergeJoin((MergeJoinState *) node);
-			break;
-
-		case T_HashJoinState:
-			ExecEndHashJoin((HashJoinState *) node);
-			break;
-
-			/*
-			 * materialization nodes
-			 */
-		case T_MaterialState:
-			ExecEndMaterial((MaterialState *) node);
-			break;
-
-		case T_SortState:
-			ExecEndSort((SortState *) node);
-			break;
-
-		case T_GroupState:
-			ExecEndGroup((GroupState *) node);
-			break;
-
-		case T_AggState:
-			ExecEndAgg((AggState *) node);
-			break;
-
-		case T_WindowAggState:
-			ExecEndWindowAgg((WindowAggState *) node);
-			break;
-
-		case T_UniqueState:
-			ExecEndUnique((UniqueState *) node);
-			break;
-
-		case T_HashState:
-			ExecEndHash((HashState *) node);
-			break;
-
-		case T_SetOpState:
-			ExecEndSetOp((SetOpState *) node);
-			break;
-
-		case T_LockRowsState:
-			ExecEndLockRows((LockRowsState *) node);
-			break;
-
-		case T_LimitState:
-			ExecEndLimit((LimitState *) node);
-			break;
-
 		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			elog(ERROR, "unrecognized/unsupported node type: %d",
+				 (int) nodeTag(node));
 			break;
 	}
 }
-- 
2.11.0

0003-Base-for-reversed-executor.patchtext/x-diffDownload
From 79e51be780dd733c6789e519176b26ea79282ea8 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 17:26:26 +0300
Subject: [PATCH 3/8] Base for reversed executor.

Framework for implementing reversed executor. Substitutes ExecutePlan call
with RunNode, which invokes pushTuple on leaf nodes in proper order.

See README for more details.
---
 src/backend/executor/README         |  45 +++++++
 src/backend/executor/execMain.c     | 255 +++++++++++++++++-------------------
 src/backend/executor/execProcnode.c |  53 +++++++-
 src/include/executor/executor.h     |   3 +
 src/include/nodes/execnodes.h       |  11 ++
 5 files changed, 230 insertions(+), 137 deletions(-)

diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c76c..86f6e99e86 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -3,6 +3,51 @@ src/backend/executor/README
 The Postgres Executor
 =====================
 
+This is an attempt to implement proof concept of executor with push-based
+achitecture like in [1]. We will call it 'reversed' executor. Right now we will
+not support both reversed and original executor, because it would involve a lot
+of either code copy-pasting (or time to avoid it), while our current goal is
+just to implement working proof of concept to estimate the benefits.
+
+Since this is a huge change, we need to outline the general strategy, things
+we will start with and how we will deal with the old code, remembering that we
+will reuse a great deal of it.
+
+Key points:
+* ExecProcNode is now a stub. All nodes code (ExecSomeNode, etc) is
+  unreachable. However, we leave it to avoid 19k lines removal commit and to
+  produce more useful diffs later; a lot of code will be reused.
+* Base for implementing push model, common for all nodes, is in execMain.c and
+  execProcNode.c. We will substitute execProcNode with pushTuple, it's interface
+  described in the comment to the definition, and this is the only change to the
+  node's interface. We make necessary changes to execMain.c, namely to
+  ExecutorRun, to run nodes in proper order from the below.
+* Then we are ready to implement the nodes one by one.
+
+At first,
+* parallel execution will not be supported.
+* subplans will not be supported.
+* we will not support ExecReScan too for now.
+* only CMD_SELECT operation will be supported.
+* only forward direction will be supported.
+* we will not support set returning functions either.
+
+In general, we try to treat the old code as follows:
+* As said above, leave it even if it dead for now.
+* If is not dead, but not yet updated for reversed executor, remove it.
+  Example is contents of ExecInitNode.
+* Sometimes we need to make minimal changes to some existing function, but these
+  changes will make it incompatible with existing code which is not yet
+  reworked.  In that case, to avoid deleting a lot of code we will just
+  copypaste it until some more generic solution will be provided. Example is
+  heapgettup_pagemode and it's 'reversed' analogue added for seqscan.
+
+
+[1] Efficiently Compiling Efficient Query Plans for Modern Hardware,
+    http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
+
+Below goes the original README text.
+
 The executor processes a tree of "plan nodes".  The plan tree is essentially
 a demand-pull pipeline of tuple processing operations.  Each node, when
 called, will produce the next tuple in its output sequence, or NULL if no
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f629f0098f..bb25a4137c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -63,6 +63,7 @@
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
+#include "executor/executor.h"
 
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
@@ -79,13 +80,7 @@ static void InitPlan(QueryDesc *queryDesc, int eflags);
 static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
 static void ExecPostprocessPlan(EState *estate);
 static void ExecEndPlan(PlanState *planstate, EState *estate);
-static void ExecutePlan(EState *estate, PlanState *planstate,
-			bool use_parallel_mode,
-			CmdType operation,
-			bool sendTuples,
-			uint64 numberTuples,
-			ScanDirection direction,
-			DestReceiver *dest);
+static void RunNode(PlanState *planstate);
 static bool ExecCheckRTEPerms(RangeTblEntry *rte);
 static bool ExecCheckRTEPermsModified(Oid relOid, Oid userid,
 						  Bitmapset *modifiedCols,
@@ -341,18 +336,24 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	if (sendTuples)
 		(*dest->rStartup) (dest, operation, queryDesc->tupDesc);
 
+	/* set up state needed for sending tuples to the dest */
+	estate->es_current_tuple_count = 0;
+	estate->es_sendTuples = sendTuples;
+	estate->es_numberTuplesRequested = count;
+	estate->es_operation = operation;
+	estate->es_dest = dest;
+
+	/*
+	 * Set the direction.
+	 */
+	estate->es_direction = direction;
+
 	/*
 	 * run plan
 	 */
 	if (!ScanDirectionIsNoMovement(direction))
-		ExecutePlan(estate,
-					queryDesc->planstate,
-					queryDesc->plannedstmt->parallelModeNeeded,
-					operation,
-					sendTuples,
-					count,
-					direction,
-					dest);
+		/* Run each leaf in right order	 */
+		RunNode(queryDesc->planstate);
 
 	/*
 	 * shutdown tuple receiver, if we started it
@@ -1533,126 +1534,6 @@ ExecEndPlan(PlanState *planstate, EState *estate)
 	}
 }
 
-/* ----------------------------------------------------------------
- *		ExecutePlan
- *
- *		Processes the query plan until we have retrieved 'numberTuples' tuples,
- *		moving in the specified direction.
- *
- *		Runs to completion if numberTuples is 0
- *
- * Note: the ctid attribute is a 'junk' attribute that is removed before the
- * user can see it
- * ----------------------------------------------------------------
- */
-static void
-ExecutePlan(EState *estate,
-			PlanState *planstate,
-			bool use_parallel_mode,
-			CmdType operation,
-			bool sendTuples,
-			uint64 numberTuples,
-			ScanDirection direction,
-			DestReceiver *dest)
-{
-	TupleTableSlot *slot;
-	uint64		current_tuple_count;
-
-	/*
-	 * initialize local variables
-	 */
-	current_tuple_count = 0;
-
-	/*
-	 * Set the direction.
-	 */
-	estate->es_direction = direction;
-
-	/*
-	 * If a tuple count was supplied, we must force the plan to run without
-	 * parallelism, because we might exit early.  Also disable parallelism
-	 * when writing into a relation, because no database changes are allowed
-	 * in parallel mode.
-	 */
-	if (numberTuples || dest->mydest == DestIntoRel)
-		use_parallel_mode = false;
-
-	if (use_parallel_mode)
-		EnterParallelMode();
-
-	/*
-	 * Loop until we've processed the proper number of tuples from the plan.
-	 */
-	for (;;)
-	{
-		/* Reset the per-output-tuple exprcontext */
-		ResetPerTupleExprContext(estate);
-
-		/*
-		 * Execute the plan and obtain a tuple
-		 */
-		slot = ExecProcNode(planstate);
-
-		/*
-		 * if the tuple is null, then we assume there is nothing more to
-		 * process so we just end the loop...
-		 */
-		if (TupIsNull(slot))
-		{
-			/* Allow nodes to release or shut down resources. */
-			(void) ExecShutdownNode(planstate);
-			break;
-		}
-
-		/*
-		 * If we have a junk filter, then project a new tuple with the junk
-		 * removed.
-		 *
-		 * Store this new "clean" tuple in the junkfilter's resultSlot.
-		 * (Formerly, we stored it back over the "dirty" tuple, which is WRONG
-		 * because that tuple slot has the wrong descriptor.)
-		 */
-		if (estate->es_junkFilter != NULL)
-			slot = ExecFilterJunk(estate->es_junkFilter, slot);
-
-		/*
-		 * If we are supposed to send the tuple somewhere, do so. (In
-		 * practice, this is probably always the case at this point.)
-		 */
-		if (sendTuples)
-		{
-			/*
-			 * If we are not able to send the tuple, we assume the destination
-			 * has closed and no more tuples can be sent. If that's the case,
-			 * end the loop.
-			 */
-			if (!((*dest->receiveSlot) (slot, dest)))
-				break;
-		}
-
-		/*
-		 * Count tuples processed, if this is a SELECT.  (For other operation
-		 * types, the ModifyTable plan node must count the appropriate
-		 * events.)
-		 */
-		if (operation == CMD_SELECT)
-			(estate->es_processed)++;
-
-		/*
-		 * check our tuple count.. if we've processed the proper number then
-		 * quit, else loop again and process more tuples.  Zero numberTuples
-		 * means no limit.
-		 */
-		current_tuple_count++;
-		if (numberTuples && numberTuples == current_tuple_count)
-			break;
-	}
-
-	if (use_parallel_mode)
-		ExitParallelMode();
-}
-
-
 /*
  * ExecRelCheck --- check that tuple meets constraints for result relation
  *
@@ -3291,3 +3172,107 @@ ExecBuildSlotPartitionKeyDescription(Relation rel,
 
 	return buf.data;
 }
+
+/*
+ * This function pushes the ready tuple to it's destination. It should
+ * be called by top-level PlanState.
+ * For now, I added the state needed for this to estate, specifically
+ * current_tuple_count, sendTuples, numberTuplesRequested (old numberTuples),
+ * cmdType, dest.
+ *
+ * slot is the tuple to push
+ * planstate is top-level node
+ * returns true, if we are ready to accept more tuples, false otherwise
+ */
+bool
+SendReadyTuple(TupleTableSlot *slot, PlanState *planstate)
+{
+	EState *estate;
+	bool sendTuples;
+	CmdType operation;
+	DestReceiver *dest;
+
+	estate = planstate->state;
+	sendTuples = estate->es_sendTuples;
+	operation = estate->es_operation;
+	dest = estate->es_dest;
+
+	if (TupIsNull(slot))
+	{
+		/* Allow nodes to release or shut down resources. */
+		(void) ExecShutdownNode(planstate);
+		return false;
+	}
+
+	/*
+	 * If we have a junk filter, then project a new tuple with the junk
+	 * removed.
+	 *
+	 * Store this new "clean" tuple in the junkfilter's resultSlot.
+	 * (Formerly, we stored it back over the "dirty" tuple, which is WRONG
+	 * because that tuple slot has the wrong descriptor.)
+	 */
+	if (estate->es_junkFilter != NULL)
+		slot = ExecFilterJunk(estate->es_junkFilter, slot);
+
+	/*
+	 * If we are supposed to send the tuple somewhere, do so. (In
+	 * practice, this is probably always the case at this point.)
+	 */
+	if (sendTuples)
+	{
+		/*
+		 * If we are not able to send the tuple, we assume the destination
+		 * has closed and no more tuples can be sent.
+		 */
+		if (!((*dest->receiveSlot) (slot, dest)))
+			return false;
+	}
+
+	/*
+	 * Count tuples processed, if this is a SELECT.  (For other operation
+	 * types, the ModifyTable plan node must count the appropriate
+	 * events.)
+	 */
+	if (operation == CMD_SELECT)
+		(estate->es_processed)++;
+
+	/*
+	 * check our tuple count.. if we've processed the proper number then
+	 * quit, else process more tuples.  Zero numberTuplesRequested
+	 * means no limit.
+	 */
+	estate->es_current_tuple_count++;
+	if (estate->es_numberTuplesRequested &&
+		estate->es_numberTuplesRequested == estate->es_current_tuple_count)
+		return false;
+
+	ResetPerTupleExprContext(estate);
+	return true;
+}
+
+/*
+ * When pushing, we have to call pushTuple on each leaf of the tree in correct
+ * order: first inner sides, then outer. This function does exactly that.
+ */
+void
+RunNode(PlanState *planstate)
+{
+	Assert(planstate != NULL);
+
+	if (innerPlanState(planstate) != NULL)
+	{
+		RunNode(innerPlanState(planstate));
+		/* I assume that if inner node exists, outer exists too */
+		RunNode(outerPlanState(planstate));
+		return;
+	}
+	if (outerPlanState(planstate) != NULL)
+	{
+		RunNode(outerPlanState(planstate));
+		return;
+	}
+
+	/* node has no childs, it is a leaf */
+	pushTuple(NULL, planstate, NULL);
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 649d1e58f6..a95cfe5430 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -155,7 +155,6 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 			result = NULL;		/* keep compiler quiet */
 			break;
 	}
-	return NULL;
 
 	/* Set up instrumentation for this node if requested */
 	if (estate->es_instrument)
@@ -164,7 +163,6 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 	return result;
 }
 
-
 /*
  * Unsupported, left to avoid deleting 19k lines of existing code
  */
@@ -175,6 +173,57 @@ ExecProcNode(PlanState *node)
 	return NULL;
 }
 
+/*
+ * Instead of old ExecProcNode, here we will have function pushTuple
+ * pushing one tuple.
+ * 'tuple' is a tuple to push
+ * 'node' is a receiver of tuple
+ * 'pusher' is a sender of a tuple, it's parent is 'node'. We need it to
+ * distinguish inner and outer pushes.
+ * Returns true if node is still accepting tuples, false if not.
+ * ReScans are not supported yet.
+ * In general, if a tuple (even NULL) was pushed into a node which returned
+ * 'false' before, the behaviour is undefined, i.e. it is not allowed;
+ * however, we will try to catch such situations with asserts.
+ * If lower node have sent NULL tuple to upper node, we for now will not care
+ * to return it meaningful bool result and sent just false by convention.
+ */
+bool
+pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)
+{
+	bool push_from_outer;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/* If the receiver is NULL, then pusher is top-level node, so we need
+	 * to send the tuple to the dest
+	 */
+	if (!node)
+	{
+		return SendReadyTuple(slot, pusher);
+	}
+
+	/*
+	 * If pusher is NULL, then node is a bottom node, another special case:
+	 * bottom nodes obviously don't need neither tuple nor pusher
+	 */
+	if (!pusher)
+	{
+		switch (nodeTag(node))
+		{
+			default:
+				elog(ERROR, "bottom node type not supported: %d",
+					 (int) nodeTag(node));
+				return false;
+		}
+	}
+
+	/* does push come from the outer side? */
+	push_from_outer = outerPlanState(node) == pusher;
+
+	elog(ERROR, "node type not supported: %d", (int) nodeTag(node));
+}
+
 /* ----------------------------------------------------------------
  * Unsupported too; we don't need it in push model
  * ----------------------------------------------------------------
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 716362970f..eb4e27ce21 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -179,6 +179,7 @@ extern void ExecutorRun(QueryDesc *queryDesc,
 			ScanDirection direction, uint64 count);
 extern void standard_ExecutorRun(QueryDesc *queryDesc,
 					 ScanDirection direction, uint64 count);
+extern bool SendReadyTuple(TupleTableSlot *slot, PlanState *planstate);
 extern void ExecutorFinish(QueryDesc *queryDesc);
 extern void standard_ExecutorFinish(QueryDesc *queryDesc);
 extern void ExecutorEnd(QueryDesc *queryDesc);
@@ -240,6 +241,8 @@ extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
+extern bool pushTuple(TupleTableSlot *slot, PlanState *node,
+					  PlanState *pusher);
 
 /*
  * prototypes from functions in execQual.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 738f098b00..da7fd9c7ac 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -28,6 +28,7 @@
 #include "utils/tuplesort.h"
 #include "nodes/tidbitmap.h"
 #include "storage/condition_variable.h"
+#include "tcop/dest.h" /* for DestReceiver type in EState */
 
 
 /* ----------------
@@ -416,6 +417,16 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * State needed to push tuples to dest in push model, technically it is
+	 * local variables from old ExecutePlan
+	 */
+	uint64		es_current_tuple_count;
+	bool		es_sendTuples;
+	uint64		es_numberTuplesRequested;
+	CmdType		es_operation;
+	DestReceiver *es_dest;
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
-- 
2.11.0

0004-Reversed-SeqScan-implementation.patchtext/x-diffDownload
From d99cead6b8eb28fe238d918aa0333389413aab77 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 21:52:01 +0300
Subject: [PATCH 4/8] Reversed SeqScan implementation.

Main job is done by heappushtups func which iterates over tuples and pushes
each. It is mostly copied heapgettup_pagemode, which is left for compatibility.

Each tuple handling (checking quals, etc) is implemented as inline functions in
nodeSeqscan.h.

Since now heapam.h must now about PlanState, some forward decls were added, kind
of ugly.

EvalPlanQual is not supported.
---
 src/backend/access/heap/heapam.c    | 256 ++++++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c |  17 +++
 src/backend/executor/nodeSeqscan.c  |  76 ++++-------
 src/include/access/heapam.h         |   9 ++
 src/include/executor/nodeSeqscan.h  | 149 ++++++++++++++++++++-
 5 files changed, 451 insertions(+), 56 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 85261379b1..0e6eafd44f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,6 +73,8 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "executor/executor.h"
+#include "executor/nodeSeqscan.h"
 
 
 /* GUC variable */
@@ -9236,3 +9238,257 @@ heap_mask(char *pagedata, BlockNumber blkno)
 		}
 	}
 }
+
+/* ----------------
+ * Fetch tuples, check quals and push them. Modified heapgettup_pagemode,
+ * a lot of copy-pasting.
+ * This function in fact doesn't care about pusher type and func,
+ * although SeqScanState and inlined SeqPushHeapTuple is hardcoded for now
+ * ----------------
+ */
+void
+heappushtups(HeapScanDesc scan,
+			 ScanDirection dir,
+			 int nkeys,
+			 ScanKey key,
+			 PlanState *node,
+			 SeqScanState *pusher)
+{
+	HeapTuple	tuple = &(scan->rs_ctup);
+	bool		backward = ScanDirectionIsBackward(dir);
+	BlockNumber page;
+	bool		finished;
+	Page		dp;
+	int			lines;
+	int			lineindex;
+	OffsetNumber lineoff;
+	int			linesleft;
+	ItemId		lpp;
+
+	/* no movement is not supported for now */
+	Assert(!ScanDirectionIsNoMovement(dir));
+
+	/*
+	 * calculate next starting lineindex, given scan direction
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				tuple->t_data = NULL;
+				SeqPushHeapTuple(&(scan->rs_ctup), node, pusher);
+				return;
+			}
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/* Other processes might have already finished the scan. */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					SeqPushHeapTuple(&(scan->rs_ctup), node, pusher);
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
+			heapgetpage(scan, page);
+			lineindex = 0;
+			scan->rs_inited = true;
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+			lineindex = scan->rs_cindex + 1;
+		}
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
+		lines = scan->rs_ntuples;
+		/* page and lineindex now reference the next visible tid */
+
+		linesleft = lines - lineindex;
+	}
+	else /* backward */
+	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				tuple->t_data = NULL;
+				SeqPushHeapTuple(&(scan->rs_ctup), node, pusher);
+				return;
+			}
+
+			/*
+			 * Disable reporting to syncscan logic in a backwards scan; it's
+			 * not very likely anyone else is doing the same thing at the same
+			 * time, and much more likely that we'll just bollix things for
+			 * forward scanners.
+			 */
+			scan->rs_syncscan = false;
+			/* start from last page of the scan */
+			if (scan->rs_startblock > 0)
+				page = scan->rs_startblock - 1;
+			else
+				page = scan->rs_nblocks - 1;
+			heapgetpage(scan, page);
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+		}
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
+		lines = scan->rs_ntuples;
+
+		if (!scan->rs_inited)
+		{
+			lineindex = lines - 1;
+			scan->rs_inited = true;
+		}
+		else
+		{
+			lineindex = scan->rs_cindex - 1;
+		}
+		/* page and lineindex now reference the previous visible tid */
+
+		linesleft = lineindex + 1;
+	}
+
+	/*
+	 * advance the scan until we find a qualifying tuple or run out of stuff
+	 * to scan
+	 */
+	for (;;)
+	{
+		while (linesleft > 0)
+		{
+			bool tuple_qualifies = false;
+
+			lineoff = scan->rs_vistuples[lineindex];
+			lpp = PageGetItemId(dp, lineoff);
+			Assert(ItemIdIsNormal(lpp));
+
+			tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+			tuple->t_len = ItemIdGetLength(lpp);
+			ItemPointerSet(&(tuple->t_self), page, lineoff);
+
+			/*
+			 * if current tuple qualifies, push it.
+			 */
+			if (key != NULL)
+			{
+				HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
+							nkeys, key, tuple_qualifies);
+			}
+			else
+			{
+				tuple_qualifies = true;
+			}
+
+			if (tuple_qualifies)
+			{
+				/* Push tuple */
+				scan->rs_cindex = lineindex;
+				pgstat_count_heap_getnext(scan->rs_rd);
+				if (!SeqPushHeapTuple(&(scan->rs_ctup), node, pusher))
+					return;
+			}
+
+			/*
+			 * and carry on to the next one anyway
+			 */
+			--linesleft;
+			if (backward)
+				--lineindex;
+			else
+				++lineindex;
+		}
+
+		/*
+		 * if we get here, it means we've exhausted the items on this page and
+		 * it's time to move to the next.
+		 */
+		if (backward)
+		{
+			finished = (page == scan->rs_startblock) ||
+				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (page == 0)
+				page = scan->rs_nblocks;
+			page--;
+		}
+		else if (scan->rs_parallel != NULL)
+		{
+			page = heap_parallelscan_nextpage(scan);
+			finished = (page == InvalidBlockNumber);
+		}
+		else
+		{
+			page++;
+			if (page >= scan->rs_nblocks)
+				page = 0;
+			finished = (page == scan->rs_startblock) ||
+				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+
+			/*
+			 * Report our new scan position for synchronization purposes. We
+			 * don't do that when moving backwards, however. That would just
+			 * mess up any other forward-moving scanners.
+			 *
+			 * Note: we do this before checking for end of scan so that the
+			 * final state of the position hint is back at the start of the
+			 * rel.  That's not strictly necessary, but otherwise when you run
+			 * the same query multiple times the starting position would shift
+			 * a little bit backwards on every invocation, which is confusing.
+			 * We don't guarantee any specific ordering in general, though.
+			 */
+			if (scan->rs_syncscan)
+				ss_report_location(scan->rs_rd, page);
+		}
+
+		/*
+		 * return NULL if we've exhausted all the pages
+		 */
+		if (finished)
+		{
+			if (BufferIsValid(scan->rs_cbuf))
+				ReleaseBuffer(scan->rs_cbuf);
+			scan->rs_cbuf = InvalidBuffer;
+			scan->rs_cblock = InvalidBlockNumber;
+			tuple->t_data = NULL;
+			scan->rs_inited = false;
+			SeqPushHeapTuple(&(scan->rs_ctup), node, pusher);
+			return;
+		}
+
+		heapgetpage(scan, page);
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
+		lines = scan->rs_ntuples;
+		linesleft = lines;
+		if (backward)
+			lineindex = lines - 1;
+		else
+			lineindex = 0;
+	}
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a95cfe5430..b0468667bb 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -149,6 +149,13 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 
 	switch (nodeTag(node))
 	{
+		/*
+		 * scan nodes
+		 */
+		case T_SeqScan:
+			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
+												   estate, eflags, parent);
+			break;
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
@@ -211,6 +218,9 @@ pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)
 	{
 		switch (nodeTag(node))
 		{
+			case T_SeqScanState:
+				return pushTupleToSeqScan((SeqScanState *) node);
+
 			default:
 				elog(ERROR, "bottom node type not supported: %d",
 					 (int) nodeTag(node));
@@ -263,6 +273,13 @@ ExecEndNode(PlanState *node)
 
 	switch (nodeTag(node))
 	{
+		/*
+		 * scan nodes
+		 */
+		case T_SeqScanState:
+			ExecEndSeqScan((SeqScanState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index e61895de0a..8c0aa44f0f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -15,7 +15,7 @@
 /*
  * INTERFACE ROUTINES
  *		ExecSeqScan				sequentially scans a relation.
- *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		pushTupleToSeqScan		pushes all tuples to parent node
  *		ExecInitSeqScan			creates and initializes a seqscan node.
  *		ExecEndSeqScan			releases any storage allocated.
  *		ExecReScanSeqScan		rescans the relation
@@ -30,29 +30,24 @@
 #include "executor/execdebug.h"
 #include "executor/nodeSeqscan.h"
 #include "utils/rel.h"
+#include "access/heapam.h"
 
 static void InitScanRelation(SeqScanState *node, EState *estate, int eflags);
-static TupleTableSlot *SeqNext(SeqScanState *node);
 
 /* ----------------------------------------------------------------
  *						Scan Support
  * ----------------------------------------------------------------
  */
 
-/* ----------------------------------------------------------------
- *		SeqNext
- *
- *		This is a workhorse for ExecSeqScan
- * ----------------------------------------------------------------
+/* Push scanned tuples to the parent. Stop when all tuples are pushed or
+ * the parent told us to stop pushing.
  */
-static TupleTableSlot *
-SeqNext(SeqScanState *node)
+bool
+pushTupleToSeqScan(SeqScanState *node)
 {
-	HeapTuple	tuple;
-	HeapScanDesc scandesc;
 	EState	   *estate;
+	HeapScanDesc scandesc;
 	ScanDirection direction;
-	TupleTableSlot *slot;
 
 	/*
 	 * get information from the estate and scan state
@@ -60,8 +55,11 @@ SeqNext(SeqScanState *node)
 	scandesc = node->ss.ss_currentScanDesc;
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
-	slot = node->ss.ss_ScanTupleSlot;
 
+	/* ExecScanFetch not implemented */
+	Assert(estate->es_epqTuple == NULL);
+
+	/* create scandesc, part of old SeqNext before heap_getnext */
 	if (scandesc == NULL)
 	{
 		/*
@@ -73,30 +71,17 @@ SeqNext(SeqScanState *node)
 								  0, NULL);
 		node->ss.ss_currentScanDesc = scandesc;
 	}
+	Assert(scandesc);
 
-	/*
-	 * get the next tuple from the table
-	 */
-	tuple = heap_getnext(scandesc, direction);
+	/* not-page-at-time not supported for now */
+	Assert(scandesc->rs_pageatatime);
+	heappushtups(scandesc, direction,
+				 scandesc->rs_nkeys,
+				 scandesc->rs_key,
+				 node->ss.ps.parent,
+				 node);
 
-	/*
-	 * save the tuple and the buffer returned to us by the access methods in
-	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
-	 * tuples returned by heap_getnext() are pointers onto disk pages and were
-	 * not created with palloc() and so should not be pfree()'d.  Note also
-	 * that ExecStoreTuple will increment the refcount of the buffer; the
-	 * refcount will not be dropped until the tuple table slot is cleared.
-	 */
-	if (tuple)
-		ExecStoreTuple(tuple,	/* tuple to store */
-					   slot,	/* slot to store in */
-					   scandesc->rs_cbuf,		/* buffer associated with this
-												 * tuple */
-					   false);	/* don't pfree this pointer */
-	else
-		ExecClearTuple(slot);
-
-	return slot;
+	return false;
 }
 
 /*
@@ -113,23 +98,6 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 }
 
 /* ----------------------------------------------------------------
- *		ExecSeqScan(node)
- *
- *		Scans the relation sequentially and returns the next qualifying
- *		tuple.
- *		We call the ExecScan() routine and pass it the appropriate
- *		access method functions.
- * ----------------------------------------------------------------
- */
-TupleTableSlot *
-ExecSeqScan(SeqScanState *node)
-{
-	return ExecScan((ScanState *) node,
-					(ExecScanAccessMtd) SeqNext,
-					(ExecScanRecheckMtd) SeqRecheck);
-}
-
-/* ----------------------------------------------------------------
  *		InitScanRelation
  *
  *		Set up to access the scan relation.
@@ -154,13 +122,12 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
  */
 SeqScanState *
-ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
+ExecInitSeqScan(SeqScan *node, EState *estate, int eflags, PlanState *parent)
 {
 	SeqScanState *scanstate;
 
@@ -177,6 +144,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SeqScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.parent = parent;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510d2f..a0b826e88d 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,6 +126,15 @@ extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
+/* forward decls because now we need to know about PlanState  */
+typedef struct PlanState PlanState;
+typedef struct SeqScanState SeqScanState;
+extern void heappushtups(HeapScanDesc scan,
+						 ScanDirection dir,
+						 int nkeys,
+						 ScanKey key,
+						 PlanState *node,
+						 SeqScanState *pusher);
 
 extern Size heap_parallelscan_estimate(Snapshot snapshot);
 extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index 92b305e138..21b8e42b6e 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -15,10 +15,15 @@
 #define NODESEQSCAN_H
 
 #include "access/parallel.h"
+#include "access/relscan.h"
 #include "nodes/execnodes.h"
+#include "executor/executor.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
 
-extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags,
+									 PlanState *parent);
+extern bool pushTupleToSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
@@ -27,4 +32,144 @@ extern void ExecSeqScanEstimate(SeqScanState *node, ParallelContext *pcxt);
 extern void ExecSeqScanInitializeDSM(SeqScanState *node, ParallelContext *pcxt);
 extern void ExecSeqScanInitializeWorker(SeqScanState *node, shm_toc *toc);
 
+/* inline functions decls and implementations */
+#pragma GCC diagnostic warning "-Winline"
+static inline void SeqPushNull(PlanState *node, SeqScanState *pusher);
+static inline TupleTableSlot *SeqStoreTuple(SeqScanState *node,
+											HeapTuple tuple);
+static inline bool SeqPushHeapTuple(HeapTuple tuple, PlanState *node,
+									SeqScanState *pusher);
+
+/* push NULL to the parent, signaling that we are done */
+static inline void
+SeqPushNull(PlanState *node, SeqScanState *pusher)
+{
+	ProjectionInfo *projInfo;
+	TupleTableSlot *slot;
+
+	projInfo = pusher->ss.ps.ps_ProjInfo;
+	slot = pusher->ss.ss_ScanTupleSlot;
+
+	ExecClearTuple(slot);
+
+	if (projInfo)
+		pushTuple(ExecClearTuple(projInfo->pi_slot), node,
+				  (PlanState *) pusher);
+	else
+		pushTuple(slot, node,
+				  (PlanState *) pusher);
+}
+
+/*
+ * HeapTuple --> node->ss_ScanTupleSlot, part of original SeqNext after
+ * heap_getnext
+ */
+static inline TupleTableSlot *
+SeqStoreTuple(SeqScanState *node, HeapTuple tuple)
+{
+	HeapScanDesc scandesc;
+	TupleTableSlot *slot;
+
+	/*
+	 * get information from the scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	slot = node->ss.ss_ScanTupleSlot;
+
+	Assert(tuple);
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot.  Note: we pass 'false' because tuples returned by
+	 * heap_getnext() are pointers onto disk pages and were not created with
+	 * palloc() and so should not be pfree()'d.  Note also that ExecStoreTuple
+	 * will increment the refcount of the buffer; the refcount will not be
+	 * dropped until the tuple table slot is cleared.
+	 */
+	ExecStoreTuple(tuple,	/* tuple to store */
+				   slot,	/* slot to store in */
+				   scandesc->rs_cbuf,		/* buffer associated with this
+											 * tuple */
+				   false);	/* don't pfree this pointer */
+	return slot;
+}
+
+/* Push ready HeapTuple from SeqScanState
+ *
+ * check qual for the tuple and push it. Tuple must be not NULL.
+ * Returns true, if parent accepts more tuples, false otherwise
+ */
+static inline bool SeqPushHeapTuple(HeapTuple tuple, PlanState *node,
+							 SeqScanState *pusher)
+{
+	ExprContext *econtext;
+	List	   *qual;
+	ProjectionInfo *projInfo;
+	TupleTableSlot *slot;
+
+	if (tuple->t_data == NULL)
+	{
+		SeqPushNull(node, pusher);
+		return false;
+	}
+
+	/*
+	 * Fetch data from node
+	 */
+	qual = pusher->ss.ps.qual;
+	projInfo = pusher->ss.ps.ps_ProjInfo;
+	econtext = pusher->ss.ps.ps_ExprContext;
+
+	CHECK_FOR_INTERRUPTS();
+
+	slot = SeqStoreTuple(pusher, tuple);
+
+	/*
+	 * If we have neither a qual to check nor a projection to do, just skip
+	 * all the overhead and return the raw scan tuple.
+	 */
+	if (!qual && !projInfo)
+	{
+		return pushTuple(slot, node, (PlanState *) pusher);
+	}
+
+	ResetExprContext(econtext);
+	/*
+	 * place the current tuple into the expr context
+	 */
+	econtext->ecxt_scantuple = slot;
+
+	/*
+	 * check that the current tuple satisfies the qual-clause
+	 *
+	 * check for non-nil qual here to avoid a function call to ExecQual()
+	 * when the qual is nil ... saves only a few cycles, but they add up
+	 * ...
+	 */
+	if (!qual || ExecQual(qual, econtext, false))
+	{
+		/*
+		 * Found a satisfactory scan tuple.
+		 */
+		if (projInfo)
+		{
+			/*
+			 * Form a projection tuple, store it in the result tuple slot
+			 * and push it --- unless we find we can project no tuples
+			 * from this scan tuple, in which case continue scan.
+			 */
+			slot = ExecProject(projInfo);
+		}
+		/*
+		 * Here, we aren't projecting, so just push scan tuple.
+		 */
+		return pushTuple(slot, node, (PlanState *) pusher);
+	}
+	else
+		InstrCountFiltered1(pusher, 1);
+
+	return true;
+}
+
+
 #endif   /* NODESEQSCAN_H */
-- 
2.11.0

0005-Reversed-HashJoin-implementation.patchtext/x-diffDownload
From c9936b1a2460c3b3cf3a42cf1ef51b4d018c6c07 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Sat, 11 Mar 2017 00:36:31 +0300
Subject: [PATCH 5/8] Reversed HashJoin implementation.

The main point is that tuples are pushed immediately after the match, e.g. we
scan the whole bucket in one loop and pushTuple each match.
---
 src/backend/executor/execProcnode.c |  39 +++
 src/backend/executor/nodeHash.c     | 242 +++++++++++++++++-
 src/backend/executor/nodeHashjoin.c | 479 +++++++++++++-----------------------
 src/include/executor/nodeHash.h     |   9 +-
 src/include/executor/nodeHashjoin.h | 100 +++++++-
 src/include/nodes/execnodes.h       |   2 +
 6 files changed, 547 insertions(+), 324 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index b0468667bb..88e14d144a 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -156,6 +156,22 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
 												   estate, eflags, parent);
 			break;
+
+		/*
+		 * join nodes
+		 */
+		case T_HashJoin:
+			result = (PlanState *) ExecInitHashJoin((HashJoin *) node,
+													estate, eflags, parent);
+			break;
+
+		/*
+		 * materialization nodes
+		 */
+		case T_Hash:
+			result = (PlanState *) ExecInitHash((Hash *) node,
+												estate, eflags, parent);
+			break;
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
@@ -231,6 +247,15 @@ pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)
 	/* does push come from the outer side? */
 	push_from_outer = outerPlanState(node) == pusher;
 
+	if (nodeTag(node) == T_HashState)
+		return pushTupleToHash(slot, (HashState *) node);
+
+	else if (nodeTag(node) == T_HashJoinState && push_from_outer)
+		return pushTupleToHashJoinFromOuter(slot, (HashJoinState *) node);
+
+	else if (nodeTag(node) == T_HashJoinState && !push_from_outer)
+		return pushTupleToHashJoinFromInner(slot, (HashJoinState *) node);
+
 	elog(ERROR, "node type not supported: %d", (int) nodeTag(node));
 }
 
@@ -280,6 +305,20 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		/*
+		 * join nodes
+		 */
+		case T_HashJoinState:
+			ExecEndHashJoin((HashJoinState *) node);
+			break;
+
+		/*
+		 * materialization nodes
+		 */
+		case T_HashState:
+			ExecEndHash((HashState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 43e65ca04e..06fe45f29b 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -50,17 +50,95 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
-/* ----------------------------------------------------------------
- *		ExecHash
- *
- *		stub for pro forma compliance
- * ----------------------------------------------------------------
+
+/* Put incoming tuples to the hastable; when NULL received, finalize building
+ * hashatable and notify HashJoin about that.
  */
-TupleTableSlot *
-ExecHash(HashState *node)
+bool
+pushTupleToHash(TupleTableSlot *slot, HashState *node)
 {
-	elog(ERROR, "Hash node does not support ExecProcNode call convention");
-	return NULL;
+	List	   *hashkeys;
+	HashJoinTable hashtable;
+	ExprContext *econtext;
+	uint32		hashvalue;
+	HashJoinState *hj_node;
+
+	hj_node = (HashJoinState *) node->ps.parent;
+
+	/* Create the hastable. In vanilla Postgres this code is in HashJoin */
+	if (node->first_time_through)
+	{
+		Assert(node->hashtable == NULL);
+
+		node->hashtable = ExecHashTableCreate((Hash *) node->ps.plan,
+											  hj_node->hj_HashOperators,
+											  HJ_FILL_INNER(hj_node));
+
+		/* must provide our own instrumentation support */
+		if (node->ps.instrument)
+			InstrStartNode(node->ps.instrument);
+
+		node->first_time_through = false;
+	}
+
+	/*
+	 * get state info from node
+	 */
+	hashtable = node->hashtable;
+
+	/*
+	 * set expression context
+	 */
+	hashkeys = node->hashkeys;
+	econtext = node->ps.ps_ExprContext;
+
+	/* NULL tuple received; let HashJoin know that the hashtable is built
+	   and exit */
+	if (TupIsNull(slot))
+	{
+		/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+		if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+			ExecHashIncreaseNumBuckets(hashtable);
+
+		/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+		hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
+		if (hashtable->spaceUsed > hashtable->spacePeak)
+			hashtable->spacePeak = hashtable->spaceUsed;
+
+		/* must provide our own instrumentation support */
+		if (node->ps.instrument)
+			InstrStopNode(node->ps.instrument, hashtable->totalTuples);
+
+		pushTuple(NULL, (PlanState *) node->ps.parent, (PlanState *) node);
+		return false;
+	}
+
+	/* We have to compute the hash value */
+	econtext->ecxt_innertuple = slot;
+	if (ExecHashGetHashValue(hashtable, econtext, hashkeys,
+							 false, hashtable->keepNulls,
+							 &hashvalue))
+	{
+		int			bucketNumber;
+
+		bucketNumber = ExecHashGetSkewBucket(hashtable, hashvalue);
+		if (bucketNumber != INVALID_SKEW_BUCKET_NO)
+		{
+			/* It's a skew tuple, so put it into that hash table */
+			ExecHashSkewTableInsert(hashtable, slot, hashvalue,
+									bucketNumber);
+			hashtable->skewTuples += 1;
+		}
+		else
+		{
+			/* Not subject to skew optimization, so insert normally */
+			ExecHashTableInsert(hashtable, slot, hashvalue);
+		}
+		hashtable->totalTuples += 1;
+	}
+
+	/* ready to accept another tuple */
+	return true;
 }
 
 /* ----------------------------------------------------------------
@@ -159,7 +237,7 @@ MultiExecHash(HashState *node)
  * ----------------------------------------------------------------
  */
 HashState *
-ExecInitHash(Hash *node, EState *estate, int eflags)
+ExecInitHash(Hash *node, EState *estate, int eflags, PlanState *parent)
 {
 	HashState  *hashstate;
 
@@ -172,8 +250,10 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	hashstate = makeNode(HashState);
 	hashstate->ps.plan = (Plan *) node;
 	hashstate->ps.state = estate;
+	hashstate->ps.parent = parent;
 	hashstate->hashtable = NULL;
 	hashstate->hashkeys = NIL;	/* will be set by parent HashJoin */
+	hashstate->first_time_through = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -201,7 +281,7 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags,
-											 (PlanState*) hashstate);
+											 (PlanState *) hashstate);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
@@ -1115,6 +1195,68 @@ ExecScanHashBucket(HashJoinState *hjstate,
 }
 
 /*
+ * ExecScanHashBucket
+ *		scan a hash bucket for matches to the current outer tuple and push
+ *		them
+ *
+ * The current outer tuple must be stored in econtext->ecxt_outertuple.
+ *
+ * Returns true, if parent still accepts tuples, false otherwise.
+ */
+bool
+ExecScanHashBucketAndPush(HashJoinState *hjstate,
+						  ExprContext *econtext)
+{
+	List	   *hjclauses = hjstate->hashclauses;
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	HashJoinTuple hashTuple;
+	uint32		hashvalue = hjstate->hj_CurHashValue;
+	bool parent_accepts_tuples = true;
+
+	/*
+	 * For now, we don't support pausing execution; we either push all matching
+	 * tuples from the bucket at once or don't touch it at all.
+	 */
+	Assert(hjstate->hj_CurTuple == NULL);
+
+	/*
+	 * If the tuple hashed to a skew bucket then scan the skew bucket
+	 * otherwise scan the standard hashtable bucket.
+	 */
+	if (hjstate->hj_CurSkewBucketNo != INVALID_SKEW_BUCKET_NO)
+		hashTuple = hashtable->skewBucket[hjstate->hj_CurSkewBucketNo]->tuples;
+	else
+		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
+
+	while (hashTuple != NULL)
+	{
+		if (hashTuple->hashvalue == hashvalue)
+		{
+			TupleTableSlot *inntuple;
+
+			/* insert hashtable's tuple into exec slot so ExecQual sees it */
+			inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+											 hjstate->hj_HashTupleSlot,
+											 false);	/* do not pfree */
+			econtext->ecxt_innertuple = inntuple;
+
+			/* reset temp memory each time to avoid leaks from qual expr */
+			ResetExprContext(econtext);
+
+			if (ExecQual(hjclauses, econtext, false))
+			{
+				hjstate->hj_CurTuple = hashTuple;
+				parent_accepts_tuples = CheckJoinQualAndPush(hjstate);
+			}
+		}
+
+		hashTuple = hashTuple->next;
+	}
+
+	return parent_accepts_tuples;
+}
+
+/*
  * ExecPrepHashTableForUnmatched
  *		set up for a series of ExecScanHashTableForUnmatched calls
  */
@@ -1206,6 +1348,84 @@ ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
 }
 
 /*
+ * ExecScanHashTableForUnmatchedAndPush
+ *		scan the hash table for unmatched inner tuples and push them
+ *
+ * Like ExecScanHashTableForUnmatched, but pushes all tuples immediately.
+ * Returns true, if parent still accepts tuples, false otherwise
+ */
+bool
+ExecScanHashTableForUnmatchedAndPush(HashJoinState *hjstate,
+									 ExprContext *econtext)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	HashJoinTuple hashTuple = NULL;
+	bool parent_accepts_tuples = true;
+
+	/*
+	 * For now, we don't support pausing execution and don't enter here twice
+	 */
+	Assert(hjstate->hj_CurTuple == NULL);
+
+	for (;;)
+	{
+		/*
+		 * hj_CurTuple is the address of the tuple last returned from the
+		 * current bucket, or NULL if it's time to start scanning a new
+		 * bucket.
+		 */
+		if (hashTuple != NULL)
+			hashTuple = hashTuple->next;
+		else if (hjstate->hj_CurBucketNo < hashtable->nbuckets)
+		{
+			hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
+			hjstate->hj_CurBucketNo++;
+		}
+		else if (hjstate->hj_CurSkewBucketNo < hashtable->nSkewBuckets)
+		{
+			int			j = hashtable->skewBucketNums[hjstate->hj_CurSkewBucketNo];
+
+			hashTuple = hashtable->skewBucket[j]->tuples;
+			hjstate->hj_CurSkewBucketNo++;
+		}
+		else
+			break;				/* finished all buckets */
+
+		while (hashTuple != NULL)
+		{
+			if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(hashTuple)))
+			{
+				TupleTableSlot *inntuple;
+
+				/* insert hashtable's tuple into exec slot */
+				inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
+												 hjstate->hj_HashTupleSlot,
+												 false);		/* do not pfree */
+				econtext->ecxt_innertuple = inntuple;
+
+				/*
+				 * Reset temp memory each time; although this function doesn't
+				 * do any qual eval, the caller will, so let's keep it
+				 * parallel to ExecScanHashBucket.
+				 */
+				ResetExprContext(econtext);
+
+				/*
+				 * Since right now we don't support pausing execution anyway,
+				 * it is probably unnecessary.
+				 */
+				hjstate->hj_CurTuple = hashTuple;
+				parent_accepts_tuples = PushUnmatched(hjstate);
+			}
+
+			hashTuple = hashTuple->next;
+		}
+	}
+
+	return parent_accepts_tuples;
+}
+
+/*
  * ExecHashTableReset
  *
  *		reset hash table header for new batch
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index b48863f90b..6c637548e1 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -27,172 +27,149 @@
 /*
  * States of the ExecHashJoin state machine
  */
-#define HJ_BUILD_HASHTABLE		1
-#define HJ_NEED_NEW_OUTER		2
-#define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
-
-/* Returns true if doing null-fill on outer relation */
-#define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
-/* Returns true if doing null-fill on inner relation */
-#define HJ_FILL_INNER(hjstate)	((hjstate)->hj_NullOuterTupleSlot != NULL)
-
-static TupleTableSlot *ExecHashJoinOuterGetTuple(PlanState *outerNode,
-						  HashJoinState *hjstate,
-						  uint32 *hashvalue);
-static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
-						  BufFile *file,
-						  uint32 *hashvalue,
-						  TupleTableSlot *tupleSlot);
+#define HJ_BUILD_HASHTABLE				1
+#define HJ_NEED_NEW_OUTER				2
+#define HJ_SCAN_BUCKET					3
+#define HJ_FILL_OUTER_TUPLE				4
+#define HJ_FILL_INNER_TUPLES			5
+#define HJ_NEED_NEW_BATCH				6
+#define HJ_WAITING_FOR_NEW_OUTER		7
+#define HJ_HANDLE_NEW_OUTER				8
+#define HJ_TAKE_OUTER_FROM_TEMP_FILE	9
+
+static TupleTableSlot *ExecHashJoinGetSavedTuple(BufFile *file,
+												 uint32 *hashvalue,
+												 TupleTableSlot *tupleSlot);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
+static TupleTableSlot *TakeOuterFromTempFile(HashJoinState *hjstate,
+											 uint32 *hashvalue);
 
 
-/* ----------------------------------------------------------------
- *		ExecHashJoin
- *
- *		This function implements the Hybrid Hashjoin algorithm.
- *
- *		Note: the relation we build hash table on is the "inner"
- *			  the other one is "outer".
- * ----------------------------------------------------------------
+
+/*
+ * This function will be called from Hash node with NULL slot, signaling
+ * that the hashtable is built.
+ * "Extract-one-outer-tuple-to-check-if-it-is-null-before-building-hashtable"
+ * optimization is not implemented for now, the hashtable will be always built
+ * first.
+ */
+bool
+pushTupleToHashJoinFromInner(TupleTableSlot *slot, HashJoinState *node)
+{
+	HashJoinTable hashtable;
+	HashState *hashNode;
+
+	hashNode = (HashState *) innerPlanState(node);
+
+	/* we should get there only once */
+	Assert(node->hj_JoinState == HJ_BUILD_HASHTABLE);
+	/* we will fish out the tuples from Hash node ourselves */
+	Assert(TupIsNull(slot));
+
+	/* we always build the hashtable first */
+	node->hj_FirstOuterTupleSlot = NULL;
+
+	hashtable = hashNode->hashtable;
+	node->hj_HashTable = hashtable;
+
+	/*
+	 * need to remember whether nbatch has increased since we
+	 * began scanning the outer relation
+	 */
+	hashtable->nbatch_outstart = hashtable->nbatch;
+
+	/*
+	 * Reset OuterNotEmpty for scan.
+	 */
+	node->hj_OuterNotEmpty = false;
+
+	node->hj_JoinState = HJ_WAITING_FOR_NEW_OUTER;
+
+	/* Don't send us anything on the inner side */
+	return false;
+}
+
+/*
+ * Push from the outer side. Find matches and send them upward to HashJoin's
+ * parent. Return true if this parent ready to accept yet another tuple, false
+ * otherwise. When this function is called, the hashtable must already
+ * be filled.
  */
-TupleTableSlot *				/* return: a tuple or NULL */
-ExecHashJoin(HashJoinState *node)
+bool
+pushTupleToHashJoinFromOuter(TupleTableSlot *slot, HashJoinState *node)
 {
-	PlanState  *outerNode;
-	HashState  *hashNode;
-	List	   *joinqual;
-	List	   *otherqual;
 	ExprContext *econtext;
 	HashJoinTable hashtable;
-	TupleTableSlot *outerTupleSlot;
 	uint32		hashvalue;
 	int			batchno;
 
 	/*
 	 * get information from HashJoin node
 	 */
-	joinqual = node->js.joinqual;
-	otherqual = node->js.ps.qual;
-	hashNode = (HashState *) innerPlanState(node);
-	outerNode = outerPlanState(node);
-	hashtable = node->hj_HashTable;
 	econtext = node->js.ps.ps_ExprContext;
+	hashtable = node->hj_HashTable;
 
-	/*
-	 * Reset per-tuple memory context to free any expression evaluation
-	 * storage allocated in the previous tuple cycle.
-	 */
-	ResetExprContext(econtext);
+	/* We must always be in this state when the tuple is pushed */
+	Assert(node->hj_JoinState == HJ_WAITING_FOR_NEW_OUTER);
 
-	/*
-	 * run the hash join state machine
-	 */
-	for (;;)
+	if (!TupIsNull(slot))
 	{
-		switch (node->hj_JoinState)
+		/*
+		 * We have to compute the tuple's hash value.
+		 */
+		econtext->ecxt_outertuple = slot;
+		if (!ExecHashGetHashValue(hashtable, econtext,
+								  node->hj_OuterHashKeys,
+								  true,		/* outer tuple */
+								  HJ_FILL_OUTER(node),
+								  &hashvalue))
 		{
-			case HJ_BUILD_HASHTABLE:
+			/*
+			 * That tuple couldn't match because of a NULL, so discard it and
+			 * wait for the next one.
+			 */
+			return true;
+		}
+	}
 
-				/*
-				 * First time through: build hash table for inner relation.
-				 */
-				Assert(hashtable == NULL);
+	/* ready to handle this slot */
+	node->hj_JoinState = HJ_HANDLE_NEW_OUTER;
 
-				/*
-				 * If the outer relation is completely empty, and it's not
-				 * right/full join, we can quit without building the hash
-				 * table.  However, for an inner join it is only a win to
-				 * check this when the outer relation's startup cost is less
-				 * than the projected cost of building the hash table.
-				 * Otherwise it's best to build the hash table first and see
-				 * if the inner relation is empty.  (When it's a left join, we
-				 * should always make this check, since we aren't going to be
-				 * able to skip the join on the strength of an empty inner
-				 * relation anyway.)
-				 *
-				 * If we are rescanning the join, we make use of information
-				 * gained on the previous scan: don't bother to try the
-				 * prefetch if the previous scan found the outer relation
-				 * nonempty. This is not 100% reliable since with new
-				 * parameters the outer relation might yield different
-				 * results, but it's a good heuristic.
-				 *
-				 * The only way to make the check is to try to fetch a tuple
-				 * from the outer plan node.  If we succeed, we have to stash
-				 * it away for later consumption by ExecHashJoinOuterGetTuple.
-				 */
-				if (HJ_FILL_INNER(node))
-				{
-					/* no chance to not build the hash table */
-					node->hj_FirstOuterTupleSlot = NULL;
-				}
-				else if (HJ_FILL_OUTER(node) ||
-						 (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
-						  !node->hj_OuterNotEmpty))
+	/* Push tuples matching to the received outer tuple while we can */
+	for (;;)
+	{
+		switch(node->hj_JoinState)
+		{
+			case HJ_NEED_NEW_OUTER:
+				if (hashtable->curbatch == 0)
 				{
-					node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode);
-					if (TupIsNull(node->hj_FirstOuterTupleSlot))
-					{
-						node->hj_OuterNotEmpty = false;
-						return NULL;
-					}
-					else
-						node->hj_OuterNotEmpty = true;
+					/*
+					 * On the first batch, we always fetch tuples from below
+					 * nodes, not from temp files. So, setting the state to
+					 * waiting for new outer and telling the node below that
+					 * we are ready to accept the tuple
+					 */
+					node->hj_JoinState = HJ_WAITING_FOR_NEW_OUTER;
+					return true;
 				}
-				else
-					node->hj_FirstOuterTupleSlot = NULL;
-
-				/*
-				 * create the hash table
-				 */
-				hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
-												node->hj_HashOperators,
-												HJ_FILL_INNER(node));
-				node->hj_HashTable = hashtable;
-
-				/*
-				 * execute the Hash node, to build the hash table
-				 */
-				hashNode->hashtable = hashtable;
-				(void) MultiExecProcNode((PlanState *) hashNode);
-
-				/*
-				 * If the inner relation is completely empty, and we're not
-				 * doing a left outer join, we can quit without scanning the
-				 * outer relation.
-				 */
-				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
-					return NULL;
-
 				/*
-				 * need to remember whether nbatch has increased since we
-				 * began scanning the outer relation
+				 * on subsequent batches, we always take tuples from temp
+				 * files
 				 */
-				hashtable->nbatch_outstart = hashtable->nbatch;
-
-				/*
-				 * Reset OuterNotEmpty for scan.  (It's OK if we fetched a
-				 * tuple above, because ExecHashJoinOuterGetTuple will
-				 * immediately set it again.)
-				 */
-				node->hj_OuterNotEmpty = false;
-
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
+				slot = TakeOuterFromTempFile(node, &hashvalue);
+				/* ready to hande this slot */
+				node->hj_JoinState = HJ_HANDLE_NEW_OUTER;
 
 				/* FALL THRU */
 
-			case HJ_NEED_NEW_OUTER:
-
-				/*
-				 * We don't have an outer tuple, try to get the next one
+			case HJ_HANDLE_NEW_OUTER:
+				/* Handle new outer tuple, either from temp files or nodes
+				 * below. It can be NULL, which means the end of batch.
+				 * hashvalue must be set at this moment, and the tuple must
+				 * be in 'slot' variable.
 				 */
-				outerTupleSlot = ExecHashJoinOuterGetTuple(outerNode,
-														   node,
-														   &hashvalue);
-				if (TupIsNull(outerTupleSlot))
+
+				if (TupIsNull(slot))
 				{
 					/* end of batch, or maybe whole join */
 					if (HJ_FILL_INNER(node))
@@ -206,7 +183,7 @@ ExecHashJoin(HashJoinState *node)
 					continue;
 				}
 
-				econtext->ecxt_outertuple = outerTupleSlot;
+				econtext->ecxt_outertuple = slot;
 				node->hj_MatchedOuter = false;
 
 				/*
@@ -232,14 +209,18 @@ ExecHashJoin(HashJoinState *node)
 					 * Save it in the corresponding outer-batch file.
 					 */
 					Assert(batchno > hashtable->curbatch);
-					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
+					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(slot),
 										  hashvalue,
-										&hashtable->outerBatchFile[batchno]);
-					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
-					continue;
+										  &hashtable->outerBatchFile[batchno]);
+					/* In fact, this can only happen while we are processing
+					 * the first batch, so we just wait for the new outer
+					 * tuple
+					 */
+					node->hj_JoinState = HJ_WAITING_FOR_NEW_OUTER;
+					return true;
 				}
 
-				/* OK, let's scan the bucket for matches */
+				/* OK, let's scan this bucket for matches with this tuple */
 				node->hj_JoinState = HJ_SCAN_BUCKET;
 
 				/* FALL THRU */
@@ -254,55 +235,14 @@ ExecHashJoin(HashJoinState *node)
 				CHECK_FOR_INTERRUPTS();
 
 				/*
-				 * Scan the selected hash bucket for matches to current outer
+				 * Push all matching tuples from selected hash bucket
 				 */
-				if (!ExecScanHashBucket(node, econtext))
-				{
-					/* out of matches; check for possible outer-join fill */
-					node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-					continue;
-				}
+				if (!ExecScanHashBucketAndPush(node, econtext))
+					return false;
 
-				/*
-				 * We've got a match, but still need to test non-hashed quals.
-				 * ExecScanHashBucket already set up all the state needed to
-				 * call ExecQual.
-				 *
-				 * If we pass the qual, then save state for next call and have
-				 * ExecProject form the projection, store it in the tuple
-				 * table, and return the slot.
-				 *
-				 * Only the joinquals determine tuple match status, but all
-				 * quals must pass to actually return the tuple.
-				 */
-				if (joinqual == NIL || ExecQual(joinqual, econtext, false))
-				{
-					node->hj_MatchedOuter = true;
-					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+				node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
 
-					/* In an antijoin, we never return a matched tuple */
-					if (node->js.jointype == JOIN_ANTI)
-					{
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-						continue;
-					}
-
-					/*
-					 * In a semijoin, we'll consider returning the first
-					 * match, but after that we're done with this outer tuple.
-					 */
-					if (node->js.jointype == JOIN_SEMI)
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-
-					if (otherqual == NIL ||
-						ExecQual(otherqual, econtext, false))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
-				}
-				else
-					InstrCountFiltered1(node, 1);
-				break;
+				/* FALL THRU */
 
 			case HJ_FILL_OUTER_TUPLE:
 
@@ -313,20 +253,16 @@ ExecHashJoin(HashJoinState *node)
 				 */
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
+				if (!node->hj_MatchedOuter && HJ_FILL_OUTER(node))
 				{
 					/*
 					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
+					 * tuple, and push it if it passes the non-join quals.
 					 */
 					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
 
-					if (otherqual == NIL ||
-						ExecQual(otherqual, econtext, false))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
+					if (!CheckOtherQualAndPush(node))
+						return false;
 				}
 				break;
 
@@ -337,24 +273,10 @@ ExecHashJoin(HashJoinState *node)
 				 * so any unmatched inner tuples in the hashtable have to be
 				 * emitted before we continue to the next batch.
 				 */
-				if (!ExecScanHashTableForUnmatched(node, econtext))
-				{
-					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
-				}
-
-				/*
-				 * Generate a fake join tuple with nulls for the outer tuple,
-				 * and return it if it passes the non-join quals.
-				 */
-				econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+				if (!ExecScanHashTableForUnmatchedAndPush(node, econtext))
+					return false;
 
-				if (otherqual == NIL ||
-					ExecQual(otherqual, econtext, false))
-					return ExecProject(node->js.ps.ps_ProjInfo);
-				else
-					InstrCountFiltered2(node, 1);
+				node->hj_JoinState = HJ_NEED_NEW_BATCH;
 				break;
 
 			case HJ_NEED_NEW_BATCH:
@@ -363,17 +285,43 @@ ExecHashJoin(HashJoinState *node)
 				 * Try to advance to next batch.  Done if there are no more.
 				 */
 				if (!ExecHashJoinNewBatch(node))
-					return NULL;	/* end of join */
+				{
+					/* let parent know that we are done */
+					pushTuple(NULL, node->js.ps.parent, (PlanState *) node);
+					return false;	/* end of join */
+				}
 				node->hj_JoinState = HJ_NEED_NEW_OUTER;
 				break;
 
-			default:
-				elog(ERROR, "unrecognized hashjoin state: %d",
-					 (int) node->hj_JoinState);
 		}
 	}
 }
 
+/*
+ * Get next outer tuple from saved temp files. We are processing not the first
+ * batch if we are here. On success, the tuple's hash value is stored at
+ * *hashvalue, re-read from the temp file.
+ * Returns NULL on the end of batch, a tuple otherwise.
+ */
+static TupleTableSlot *TakeOuterFromTempFile(HashJoinState *hjstate,
+											 uint32 *hashvalue)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *file = hashtable->outerBatchFile[curbatch];
+
+	/*
+	 * In outer-join cases, we could get here even though the batch file
+	 * is empty.
+	 */
+	if (file == NULL)
+		return NULL;
+
+	return ExecHashJoinGetSavedTuple(file,
+									 hashvalue,
+									 hjstate->hj_OuterTupleSlot);
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitHashJoin
  *
@@ -381,7 +329,7 @@ ExecHashJoin(HashJoinState *node)
  * ----------------------------------------------------------------
  */
 HashJoinState *
-ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
+ExecInitHashJoin(HashJoin *node, EState *estate, int eflags, PlanState *parent)
 {
 	HashJoinState *hjstate;
 	Plan	   *outerNode;
@@ -400,6 +348,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate = makeNode(HashJoinState);
 	hjstate->js.ps.plan = (Plan *) node;
 	hjstate->js.ps.state = estate;
+	hjstate->js.ps.parent = parent;
 
 	/*
 	 * Miscellaneous initialization
@@ -579,89 +528,6 @@ ExecEndHashJoin(HashJoinState *node)
 }
 
 /*
- * ExecHashJoinOuterGetTuple
- *
- *		get the next outer tuple for hashjoin: either by
- *		executing the outer plan node in the first pass, or from
- *		the temp files for the hashjoin batches.
- *
- * Returns a null slot if no more outer tuples (within the current batch).
- *
- * On success, the tuple's hash value is stored at *hashvalue --- this is
- * either originally computed, or re-read from the temp file.
- */
-static TupleTableSlot *
-ExecHashJoinOuterGetTuple(PlanState *outerNode,
-						  HashJoinState *hjstate,
-						  uint32 *hashvalue)
-{
-	HashJoinTable hashtable = hjstate->hj_HashTable;
-	int			curbatch = hashtable->curbatch;
-	TupleTableSlot *slot;
-
-	if (curbatch == 0)			/* if it is the first pass */
-	{
-		/*
-		 * Check to see if first outer tuple was already fetched by
-		 * ExecHashJoin() and not used yet.
-		 */
-		slot = hjstate->hj_FirstOuterTupleSlot;
-		if (!TupIsNull(slot))
-			hjstate->hj_FirstOuterTupleSlot = NULL;
-		else
-			slot = ExecProcNode(outerNode);
-
-		while (!TupIsNull(slot))
-		{
-			/*
-			 * We have to compute the tuple's hash value.
-			 */
-			ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
-
-			econtext->ecxt_outertuple = slot;
-			if (ExecHashGetHashValue(hashtable, econtext,
-									 hjstate->hj_OuterHashKeys,
-									 true,		/* outer tuple */
-									 HJ_FILL_OUTER(hjstate),
-									 hashvalue))
-			{
-				/* remember outer relation is not empty for possible rescan */
-				hjstate->hj_OuterNotEmpty = true;
-
-				return slot;
-			}
-
-			/*
-			 * That tuple couldn't match because of a NULL, so discard it and
-			 * continue with the next one.
-			 */
-			slot = ExecProcNode(outerNode);
-		}
-	}
-	else if (curbatch < hashtable->nbatch)
-	{
-		BufFile    *file = hashtable->outerBatchFile[curbatch];
-
-		/*
-		 * In outer-join cases, we could get here even though the batch file
-		 * is empty.
-		 */
-		if (file == NULL)
-			return NULL;
-
-		slot = ExecHashJoinGetSavedTuple(hjstate,
-										 file,
-										 hashvalue,
-										 hjstate->hj_OuterTupleSlot);
-		if (!TupIsNull(slot))
-			return slot;
-	}
-
-	/* End of this batch */
-	return NULL;
-}
-
-/*
  * ExecHashJoinNewBatch
  *		switch to a new hashjoin batch
  *
@@ -769,8 +635,7 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					(errcode_for_file_access(),
 				   errmsg("could not rewind hash-join temporary file: %m")));
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
+		while ((slot = ExecHashJoinGetSavedTuple(innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
@@ -849,8 +714,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
  * itself is stored in the given slot.
  */
 static TupleTableSlot *
-ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
-						  BufFile *file,
+ExecHashJoinGetSavedTuple(BufFile *file,
 						  uint32 *hashvalue,
 						  TupleTableSlot *tupleSlot)
 {
@@ -893,7 +757,6 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
 }
 
-
 void
 ExecReScanHashJoin(HashJoinState *node)
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index fe5c2642d7..1ac95a20fd 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -16,8 +16,9 @@
 
 #include "nodes/execnodes.h"
 
-extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHash(HashState *node);
+extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags,
+							   PlanState* parent);
+extern bool pushTupleToHash(TupleTableSlot *slot, HashState *node);
 extern Node *MultiExecHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
@@ -39,9 +40,13 @@ extern void ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 						  int *bucketno,
 						  int *batchno);
 extern bool ExecScanHashBucket(HashJoinState *hjstate, ExprContext *econtext);
+extern bool ExecScanHashBucketAndPush(HashJoinState *hjstate,
+									  ExprContext *econtext);
 extern void ExecPrepHashTableForUnmatched(HashJoinState *hjstate);
 extern bool ExecScanHashTableForUnmatched(HashJoinState *hjstate,
 							  ExprContext *econtext);
+extern bool ExecScanHashTableForUnmatchedAndPush(HashJoinState *hjstate,
+							  ExprContext *econtext);
 extern void ExecHashTableReset(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
 extern void ExecChooseHashTableSize(double ntuples, int tupwidth, bool useskew,
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index ddc32b1de3..8b3b88917c 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -16,13 +16,107 @@
 
 #include "nodes/execnodes.h"
 #include "storage/buffile.h"
+#include "executor/executor.h"
+#include "executor/hashjoin.h"
+#include "access/htup_details.h"
+#include "utils/memutils.h"
 
-extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+/* Returns true if doing null-fill on outer relation */
+#define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
+/* Returns true if doing null-fill on inner relation */
+#define HJ_FILL_INNER(hjstate)	((hjstate)->hj_NullOuterTupleSlot != NULL)
+
+extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate,
+									   int eflags, PlanState *parent);
+extern bool pushTupleToHashJoinFromInner(TupleTableSlot *slot,
+								  HashJoinState *node);
+extern bool pushTupleToHashJoinFromOuter(TupleTableSlot *slot,
+										 HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
 extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
 					  BufFile **fileptr);
 
-#endif   /* NODEHASHJOIN_H */
+/* inline funcs decls and implementations */
+#pragma GCC diagnostic warning "-Winline"
+static inline bool CheckOtherQualAndPush(HashJoinState *node);
+static inline bool PushUnmatched(HashJoinState *node);
+static inline bool CheckJoinQualAndPush(HashJoinState *node);
+
+/*
+ * Everything is ready for checking otherqual and projecting; do that,
+ * and push the result.
+ *
+ * Returns true if parent accepts more tuples, false otherwise
+ */
+static inline bool CheckOtherQualAndPush(HashJoinState *node)
+{
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+	List *otherqual = node->js.ps.qual;
+	TupleTableSlot *slot;
+
+	if (otherqual == NIL ||
+		ExecQual(otherqual, econtext, false))
+	{
+		slot = ExecProject(node->js.ps.ps_ProjInfo);
+		return pushTuple(slot, node->js.ps.parent, (PlanState *) node);
+	}
+	else
+		InstrCountFiltered2(node, 1);
+	return true;
+}
+
+/*
+ * Push inner tuple with no match, ExecScanHashTableForUnmatchedAndPush
+ * prepared state needed for ExecQual.
+ *
+ * Returns true if parent accepts more tuples, false otherwise.
+ */
+static inline bool PushUnmatched(HashJoinState *node)
+{
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+	/*
+	 * Reset per-tuple memory context to free any expression evaluation
+	 * storage.
+	 */
+	ResetExprContext(econtext);
+
+	/*
+	 * Generate a fake join tuple with nulls for the outer tuple,
+	 * and return it if it passes the non-join quals.
+	 */
+	econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
+	return CheckOtherQualAndPush(node);
+}
+
+/*
+ * We have found inner tuple with hashed quals matched to the current outer
+ * tuple. Now check non-hashed quals, other quals, then project and push
+ * the result.
+ *
+ * State for ExecQual was already set by ExecScanHashBucketAndPush and before.
+ * Returns true if parent accepts more tuples, false otherwise.
+ */
+static inline bool CheckJoinQualAndPush(HashJoinState *node)
+{
+	List	   *joinqual = node->js.joinqual;
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+
+	/*
+	 * Only the joinquals determine tuple match status, but all
+	 * quals must pass to actually return the tuple.
+	 */
+	if (joinqual == NIL || ExecQual(joinqual, econtext, false))
+	{
+		node->hj_MatchedOuter = true;
+		HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+		return CheckOtherQualAndPush(node);
+	}
+	else
+		InstrCountFiltered1(node, 1);
+
+	return true;
+}
+
+#endif	 /* NODEHASHJOIN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7fd9c7ac..abbe67ba0c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2145,6 +2145,8 @@ typedef struct HashState
 	HashJoinTable hashtable;	/* hash table for the hashjoin */
 	List	   *hashkeys;		/* list of ExprState nodes */
 	/* hashkeys is same as parent's hj_InnerHashKeys */
+	/* on the first push we must build the hashtable */
+	bool first_time_through;
 } HashState;
 
 /* ----------------
-- 
2.11.0

0006-Reversed-Limit-implementation.patchtext/x-diffDownload
From 1571b0d239d61f74b311c61dad1a68ce2c048af2 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Sat, 11 Mar 2017 02:33:47 +0300
Subject: [PATCH 6/8] Reversed Limit implementation.

---
 src/backend/executor/execProcnode.c |  14 ++-
 src/backend/executor/nodeLimit.c    | 245 +++++++++---------------------------
 src/include/executor/nodeLimit.h    |   5 +-
 src/include/nodes/execnodes.h       |  10 +-
 4 files changed, 75 insertions(+), 199 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 88e14d144a..108659fafb 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -172,6 +172,12 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 			result = (PlanState *) ExecInitHash((Hash *) node,
 												estate, eflags, parent);
 			break;
+
+		case T_Limit:
+			result = (PlanState *) ExecInitLimit((Limit *) node,
+												 estate, eflags, parent);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
@@ -247,7 +253,9 @@ pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)
 	/* does push come from the outer side? */
 	push_from_outer = outerPlanState(node) == pusher;
 
-	if (nodeTag(node) == T_HashState)
+	if (nodeTag(node) == T_LimitState)
+		return pushTupleToLimit(slot, (LimitState *) node);
+	else if (nodeTag(node) == T_HashState)
 		return pushTupleToHash(slot, (HashState *) node);
 
 	else if (nodeTag(node) == T_HashJoinState && push_from_outer)
@@ -319,6 +327,10 @@ ExecEndNode(PlanState *node)
 			ExecEndHash((HashState *) node);
 			break;
 
+		case T_LimitState:
+			ExecEndLimit((LimitState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index bcacbfc13b..6e1ec6f77e 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -28,199 +28,67 @@
 static void recompute_limits(LimitState *node);
 static void pass_down_bound(LimitState *node, PlanState *child_node);
 
-
-/* ----------------------------------------------------------------
- *		ExecLimit
- *
- *		This is a very simple node which just performs LIMIT/OFFSET
- *		filtering on the stream of tuples returned by a subplan.
- * ----------------------------------------------------------------
- */
-TupleTableSlot *				/* return: a tuple or NULL */
-ExecLimit(LimitState *node)
+bool
+pushTupleToLimit(TupleTableSlot *slot, LimitState *node)
 {
-	ScanDirection direction;
-	TupleTableSlot *slot;
-	PlanState  *outerPlan;
+	bool parent_accepts_tuples;
+	bool limit_accepts_tuples;
+	/* last tuple in the window just pushed */
+	bool last_tuple_pushed;
 
 	/*
-	 * get information from the node
+	 * Backward direction is not supported at the moment
 	 */
-	direction = node->ps.state->es_direction;
-	outerPlan = outerPlanState(node);
+	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+	/* guard against calling pushTupleToLimit after it returned false */
+	Assert(node->lstate != LIMIT_DONE);
 
-	/*
-	 * The main logic is a simple state machine.
-	 */
-	switch (node->lstate)
+	if (TupIsNull(slot))
 	{
-		case LIMIT_INITIAL:
-
-			/*
-			 * First call for this node, so compute limit/offset. (We can't do
-			 * this any earlier, because parameters from upper nodes will not
-			 * be set during ExecInitLimit.)  This also sets position = 0 and
-			 * changes the state to LIMIT_RESCAN.
-			 */
-			recompute_limits(node);
-
-			/* FALL THRU */
-
-		case LIMIT_RESCAN:
-
-			/*
-			 * If backwards scan, just return NULL without changing state.
-			 */
-			if (!ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Check for empty window; if so, treat like empty subplan.
-			 */
-			if (node->count <= 0 && !node->noCount)
-			{
-				node->lstate = LIMIT_EMPTY;
-				return NULL;
-			}
-
-			/*
-			 * Fetch rows from subplan until we reach position > offset.
-			 */
-			for (;;)
-			{
-				slot = ExecProcNode(outerPlan);
-				if (TupIsNull(slot))
-				{
-					/*
-					 * The subplan returns too few tuples for us to produce
-					 * any output at all.
-					 */
-					node->lstate = LIMIT_EMPTY;
-					return NULL;
-				}
-				node->subSlot = slot;
-				if (++node->position > node->offset)
-					break;
-			}
-
-			/*
-			 * Okay, we have the first tuple of the window.
-			 */
-			node->lstate = LIMIT_INWINDOW;
-			break;
-
-		case LIMIT_EMPTY:
-
-			/*
-			 * The subplan is known to return no tuples (or not more than
-			 * OFFSET tuples, in general).  So we return no tuples.
-			 */
-			return NULL;
-
-		case LIMIT_INWINDOW:
-			if (ScanDirectionIsForward(direction))
-			{
-				/*
-				 * Forwards scan, so check for stepping off end of window. If
-				 * we are at the end of the window, return NULL without
-				 * advancing the subplan or the position variable; but change
-				 * the state machine state to record having done so.
-				 */
-				if (!node->noCount &&
-					node->position - node->offset >= node->count)
-				{
-					node->lstate = LIMIT_WINDOWEND;
-					return NULL;
-				}
-
-				/*
-				 * Get next tuple from subplan, if any.
-				 */
-				slot = ExecProcNode(outerPlan);
-				if (TupIsNull(slot))
-				{
-					node->lstate = LIMIT_SUBPLANEOF;
-					return NULL;
-				}
-				node->subSlot = slot;
-				node->position++;
-			}
-			else
-			{
-				/*
-				 * Backwards scan, so check for stepping off start of window.
-				 * As above, change only state-machine status if so.
-				 */
-				if (node->position <= node->offset + 1)
-				{
-					node->lstate = LIMIT_WINDOWSTART;
-					return NULL;
-				}
-
-				/*
-				 * Get previous tuple from subplan; there should be one!
-				 */
-				slot = ExecProcNode(outerPlan);
-				if (TupIsNull(slot))
-					elog(ERROR, "LIMIT subplan failed to run backwards");
-				node->subSlot = slot;
-				node->position--;
-			}
-			break;
-
-		case LIMIT_SUBPLANEOF:
-			if (ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Backing up from subplan EOF, so re-fetch previous tuple; there
-			 * should be one!  Note previous tuple must be in window.
-			 */
-			slot = ExecProcNode(outerPlan);
-			if (TupIsNull(slot))
-				elog(ERROR, "LIMIT subplan failed to run backwards");
-			node->subSlot = slot;
-			node->lstate = LIMIT_INWINDOW;
-			/* position does not change 'cause we didn't advance it before */
-			break;
-
-		case LIMIT_WINDOWEND:
-			if (ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Backing up from window end: simply re-return the last tuple
-			 * fetched from the subplan.
-			 */
-			slot = node->subSlot;
-			node->lstate = LIMIT_INWINDOW;
-			/* position does not change 'cause we didn't advance it before */
-			break;
-
-		case LIMIT_WINDOWSTART:
-			if (!ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Advancing after having backed off window start: simply
-			 * re-return the last tuple fetched from the subplan.
-			 */
-			slot = node->subSlot;
-			node->lstate = LIMIT_INWINDOW;
-			/* position does not change 'cause we didn't change it before */
-			break;
-
-		default:
-			elog(ERROR, "impossible LIMIT state: %d",
-				 (int) node->lstate);
-			slot = NULL;		/* keep compiler quiet */
-			break;
+		/* NULL came from below, so this is the end of input anyway */
+		node->lstate = LIMIT_DONE;
+		pushTuple(slot, node->ps.parent, (PlanState *) node);
+		return false;
 	}
 
-	/* Return the current tuple */
-	Assert(!TupIsNull(slot));
+	if (node->lstate == LIMIT_INITIAL)
+	{
+		/*
+		 * First call for this node, so compute limit/offset. (We can't do
+		 * this any earlier, because parameters from upper nodes will not
+		 * be set during ExecInitLimit.) This also sets position = 0.
+		 */
+		recompute_limits(node);
+
+		/*
+		 * Check for empty window; if so, treat like empty subplan.
+		 */
+		if (!node->noCount && node->count <= 0)
+		{
+			node->lstate = LIMIT_DONE;
+			pushTuple(NULL, node->ps.parent, (PlanState *) node);
+			return false;
+		}
+
+		node->lstate = LIMIT_ACTIVE;
+	}
 
-	return slot;
+	if (++node->position <= node->offset)
+	{
+		/* we are not inside the window yet, wait for the next tuple */
+		return true;
+	}
+	/* Now we are sure that we are inside the window and this tuple has to be
+	   pushed */
+	parent_accepts_tuples = pushTuple(slot, node->ps.parent,
+									  (PlanState *) node);
+	/* Probably OFFSET is exhausted */
+	last_tuple_pushed = !node->noCount &&
+		node->position - node->offset >= node->count;
+	limit_accepts_tuples = parent_accepts_tuples && !last_tuple_pushed;
+	if (!limit_accepts_tuples)
+		node->lstate = LIMIT_DONE;
+	return limit_accepts_tuples;
 }
 
 /*
@@ -290,9 +158,6 @@ recompute_limits(LimitState *node)
 	node->position = 0;
 	node->subSlot = NULL;
 
-	/* Set state-machine state */
-	node->lstate = LIMIT_RESCAN;
-
 	/* Notify child node about limit, if useful */
 	pass_down_bound(node, outerPlanState(node));
 }
@@ -361,7 +226,7 @@ pass_down_bound(LimitState *node, PlanState *child_node)
  * ----------------------------------------------------------------
  */
 LimitState *
-ExecInitLimit(Limit *node, EState *estate, int eflags)
+ExecInitLimit(Limit *node, EState *estate, int eflags, PlanState *parent)
 {
 	LimitState *limitstate;
 	Plan	   *outerPlan;
@@ -375,6 +240,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	limitstate = makeNode(LimitState);
 	limitstate->ps.plan = (Plan *) node;
 	limitstate->ps.state = estate;
+	limitstate->ps.parent = parent;
 
 	limitstate->lstate = LIMIT_INITIAL;
 
@@ -403,7 +269,8 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
+	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags,
+											  (PlanState *) limitstate);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 6e4084b46d..348a0352fc 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -16,8 +16,9 @@
 
 #include "nodes/execnodes.h"
 
-extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLimit(LimitState *node);
+extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags,
+								 PlanState *parent);
+extern bool pushTupleToLimit(TupleTableSlot *slot, LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index abbe67ba0c..056db943b0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2208,13 +2208,9 @@ typedef struct LockRowsState
  */
 typedef enum
 {
-	LIMIT_INITIAL,				/* initial state for LIMIT node */
-	LIMIT_RESCAN,				/* rescan after recomputing parameters */
-	LIMIT_EMPTY,				/* there are no returnable rows */
-	LIMIT_INWINDOW,				/* have returned a row in the window */
-	LIMIT_SUBPLANEOF,			/* at EOF of subplan (within window) */
-	LIMIT_WINDOWEND,			/* stepped off end of window */
-	LIMIT_WINDOWSTART			/* stepped off beginning of window */
+	LIMIT_INITIAL,		/* initial state for LIMIT node */
+	LIMIT_ACTIVE,		/* waiting for tuples */
+	LIMIT_DONE,			/* pushed all needed tuples */
 } LimitStateCond;
 
 typedef struct LimitState
-- 
2.11.0

0007-Reversed-hashed-Agg-implementation.patchtext/x-diffDownload
From 2288085916540926069ddb0a6d5499327450dbd4 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Tue, 14 Mar 2017 15:26:55 +0300
Subject: [PATCH 7/8] Reversed hashed Agg implementation.

Only hashed Agg is reversed. The part relating to putting tuples to hashtable
is practically the same with hashtable lookups inlined.

To iterate over the hashtable effectively, method foreach was added to
simplehash.h. As in SeqScan or Hashjoin, the goal is to have only one loop
iterating over the tuples and sending them. We can have only one 'foreach' type
per hashtable in current implementation, this obviously should be changed if
needed.
---
 src/backend/executor/execGrouping.c |  75 ----
 src/backend/executor/execProcnode.c |  13 +
 src/backend/executor/nodeAgg.c      | 717 ++++++++++++++++++++++++++----------
 src/include/executor/executor.h     |  98 ++++-
 src/include/executor/nodeAgg.h      |   5 +-
 src/include/lib/simplehash.h        |  60 +++
 6 files changed, 683 insertions(+), 285 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 4b1f634e21..7d5ae4aa04 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -51,81 +51,6 @@ static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tup
  *****************************************************************************/
 
 /*
- * execTuplesMatch
- *		Return true if two tuples match in all the indicated fields.
- *
- * This actually implements SQL's notion of "not distinct".  Two nulls
- * match, a null and a not-null don't match.
- *
- * slot1, slot2: the tuples to compare (must have same columns!)
- * numCols: the number of attributes to be examined
- * matchColIdx: array of attribute column numbers
- * eqFunctions: array of fmgr lookup info for the equality functions to use
- * evalContext: short-term memory context for executing the functions
- *
- * NB: evalContext is reset each time!
- */
-bool
-execTuplesMatch(TupleTableSlot *slot1,
-				TupleTableSlot *slot2,
-				int numCols,
-				AttrNumber *matchColIdx,
-				FmgrInfo *eqfunctions,
-				MemoryContext evalContext)
-{
-	MemoryContext oldContext;
-	bool		result;
-	int			i;
-
-	/* Reset and switch into the temp context. */
-	MemoryContextReset(evalContext);
-	oldContext = MemoryContextSwitchTo(evalContext);
-
-	/*
-	 * We cannot report a match without checking all the fields, but we can
-	 * report a non-match as soon as we find unequal fields.  So, start
-	 * comparing at the last field (least significant sort key). That's the
-	 * most likely to be different if we are dealing with sorted input.
-	 */
-	result = true;
-
-	for (i = numCols; --i >= 0;)
-	{
-		AttrNumber	att = matchColIdx[i];
-		Datum		attr1,
-					attr2;
-		bool		isNull1,
-					isNull2;
-
-		attr1 = slot_getattr(slot1, att, &isNull1);
-
-		attr2 = slot_getattr(slot2, att, &isNull2);
-
-		if (isNull1 != isNull2)
-		{
-			result = false;		/* one null and one not; they aren't equal */
-			break;
-		}
-
-		if (isNull1)
-			continue;			/* both are null, treat as equal */
-
-		/* Apply the type-specific equality function */
-
-		if (!DatumGetBool(FunctionCall2(&eqfunctions[i],
-										attr1, attr2)))
-		{
-			result = false;		/* they aren't equal */
-			break;
-		}
-	}
-
-	MemoryContextSwitchTo(oldContext);
-
-	return result;
-}
-
-/*
  * execTuplesUnequal
  *		Return true if two tuples are definitely unequal in the indicated
  *		fields.
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 108659fafb..1aca5f0d75 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -168,6 +168,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 		/*
 		 * materialization nodes
 		 */
+		case T_Agg:
+			result = (PlanState *) ExecInitAgg((Agg *) node,
+											   estate, eflags, parent);
+			break;
+
 		case T_Hash:
 			result = (PlanState *) ExecInitHash((Hash *) node,
 												estate, eflags, parent);
@@ -255,6 +260,10 @@ pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)
 
 	if (nodeTag(node) == T_LimitState)
 		return pushTupleToLimit(slot, (LimitState *) node);
+
+	else if (nodeTag(node) == T_AggState)
+		return pushTupleToAgg(slot, (AggState *) node);
+
 	else if (nodeTag(node) == T_HashState)
 		return pushTupleToHash(slot, (HashState *) node);
 
@@ -323,6 +332,10 @@ ExecEndNode(PlanState *node)
 		/*
 		 * materialization nodes
 		 */
+		case T_AggState:
+			ExecEndAgg((AggState *) node);
+			break;
+
 		case T_HashState:
 			ExecEndHash((HashState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index fa19358d19..b90ee28425 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -153,6 +153,8 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/parallel.h"
+#include "access/hash.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
@@ -440,7 +442,7 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 }	AggStatePerPhaseData;
 
-
+#pragma GCC diagnostic warning "-Winline"
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
 static void initialize_aggregates(AggState *aggstate,
@@ -460,10 +462,10 @@ static void process_ordered_aggregate_single(AggState *aggstate,
 static void process_ordered_aggregate_multi(AggState *aggstate,
 								AggStatePerTrans pertrans,
 								AggStatePerGroup pergroupstate);
-static void finalize_aggregate(AggState *aggstate,
-				   AggStatePerAgg peragg,
-				   AggStatePerGroup pergroupstate,
-				   Datum *resultVal, bool *resultIsNull);
+static inline void finalize_aggregate(AggState *aggstate,
+									  AggStatePerAgg peragg,
+									  AggStatePerGroup pergroupstate,
+									  Datum *resultVal, bool *resultIsNull);
 static void finalize_partialaggregate(AggState *aggstate,
 						  AggStatePerAgg peragg,
 						  AggStatePerGroup pergroupstate,
@@ -471,19 +473,21 @@ static void finalize_partialaggregate(AggState *aggstate,
 static void prepare_projection_slot(AggState *aggstate,
 						TupleTableSlot *slot,
 						int currentSet);
-static void finalize_aggregates(AggState *aggstate,
-					AggStatePerAgg peragg,
-					AggStatePerGroup pergroup,
-					int currentSet);
+static inline void finalize_aggregates(AggState *aggstate,
+									   AggStatePerAgg peraggs,
+									   AggStatePerGroup pergroup,
+									   int currentSet);
 static TupleTableSlot *project_aggregates(AggState *aggstate);
+static inline bool project_aggregates_and_push(AggState *aggstate);
+static inline bool AggPushHashEntry(TupleHashEntry entry, void *astate);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate,
-				  TupleTableSlot *inputslot);
+static inline TupleHashEntryData *lookup_hash_entry(AggState *aggstate,
+													TupleTableSlot *inputslot);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_fill_hash_table(AggState *aggstate);
-static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static void agg_puttup_hash_table(AggState *aggstate, TupleTableSlot *outerslot);
+static void agg_push_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
 						  AggState *aggstate, EState *estate,
@@ -498,6 +502,39 @@ static int find_compatible_pertrans(AggState *aggstate, Aggref *newagg,
 						 Oid aggserialfn, Oid aggdeserialfn,
 						 Datum initValue, bool initValueIsNull,
 						 List *transnos);
+/*
+ * We use our own hash table instead of defined in execGrouping.c, see notes
+ * below.
+ */
+/* define paramters necessary to generate the tuple hash table interface */
+#define SH_PREFIX aggtuplehash
+#define SH_ELEMENT_TYPE TupleHashEntryData
+#define SH_KEY_TYPE MinimalTuple
+#define SH_SCOPE static inline
+#define SH_FOREACH_ON
+#define SH_FOREACH_ACC_TYPE bool
+#define SH_DECLARE
+#include "lib/simplehash.h"
+static inline bool inline_and(bool old, bool new);
+
+/*
+ * And our own copies of funcs from execGrouping.c
+ */
+static TupleHashTable BuildAggTupleHashTable(int numCols, AttrNumber *keyColIdx,
+											 FmgrInfo *eqfunctions,
+											 FmgrInfo *hashfunctions,
+											 long nbuckets, Size additionalsize,
+											 MemoryContext tablecxt,
+											 MemoryContext tempcxt,
+											 bool use_variable_hash_iv);
+static inline TupleHashEntry LookupAggTupleHashEntry(TupleHashTable hashtable,
+													 TupleTableSlot *slot,
+													 bool *isnew);
+static inline uint32 AggTupleHashTableHash(struct aggtuplehash_hash *tb,
+										   const MinimalTuple tuple);
+static inline int AggTupleHashTableMatch(struct aggtuplehash_hash *tb,
+										 const MinimalTuple tuple1,
+										 const MinimalTuple tuple2);
 
 
 /*
@@ -1573,7 +1610,6 @@ finalize_aggregates(AggState *aggstate,
 	Datum	   *aggvalues = econtext->ecxt_aggvalues;
 	bool	   *aggnulls = econtext->ecxt_aggnulls;
 	int			aggno;
-	int			transno;
 
 	Assert(currentSet == 0 ||
 		   ((Agg *) aggstate->ss.ps.plan)->aggstrategy != AGG_HASHED);
@@ -1581,32 +1617,6 @@ finalize_aggregates(AggState *aggstate,
 	aggstate->current_set = currentSet;
 
 	/*
-	 * If there were any DISTINCT and/or ORDER BY aggregates, sort their
-	 * inputs and run the transition functions.
-	 */
-	for (transno = 0; transno < aggstate->numtrans; transno++)
-	{
-		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
-		AggStatePerGroup pergroupstate;
-
-		pergroupstate = &pergroup[transno + (currentSet * (aggstate->numtrans))];
-
-		if (pertrans->numSortCols > 0)
-		{
-			Assert(((Agg *) aggstate->ss.ps.plan)->aggstrategy != AGG_HASHED);
-
-			if (pertrans->numInputs == 1)
-				process_ordered_aggregate_single(aggstate,
-												 pertrans,
-												 pergroupstate);
-			else
-				process_ordered_aggregate_multi(aggstate,
-												pertrans,
-												pergroupstate);
-		}
-	}
-
-	/*
 	 * Run the final functions.
 	 */
 	for (aggno = 0; aggno < aggstate->numaggs; aggno++)
@@ -1617,12 +1627,8 @@ finalize_aggregates(AggState *aggstate,
 
 		pergroupstate = &pergroup[transno + (currentSet * (aggstate->numtrans))];
 
-		if (DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit))
-			finalize_partialaggregate(aggstate, peragg, pergroupstate,
-									  &aggvalues[aggno], &aggnulls[aggno]);
-		else
-			finalize_aggregate(aggstate, peragg, pergroupstate,
-							   &aggvalues[aggno], &aggnulls[aggno]);
+		finalize_aggregate(aggstate, peragg, pergroupstate,
+						   &aggvalues[aggno], &aggnulls[aggno]);
 	}
 }
 
@@ -1654,6 +1660,106 @@ project_aggregates(AggState *aggstate)
 }
 
 /*
+ * Project the result of a group (whose aggs have already been calculated by
+ * finalize_aggregates), and push all tuples. Returns true if all tuples are
+ * were pushed, false if the parent doesn't want to accept tuples anymore.
+ */
+static inline bool
+project_aggregates_and_push(AggState *aggstate)
+{
+	ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+	PlanState *parent = aggstate->ss.ps.parent;
+
+	/*
+	 * Check the qual (HAVING clause); if the group does not match, ignore it.
+	 */
+	if (ExecQual(aggstate->ss.ps.qual, econtext, false))
+	{
+		/*
+		 * Form and return or store a projection tuple using the aggregate
+		 * results and the representative input tuple.
+		 */
+		TupleTableSlot *slot;
+
+		slot = ExecProject(aggstate->ss.ps.ps_ProjInfo);
+		return pushTuple(slot, parent, (PlanState *) aggstate);
+
+	}
+	else
+		InstrCountFiltered1(aggstate, 1);
+
+	return true;
+}
+
+/*
+ * Finalize one TupleHashEntry, project the result and push it. Returns true
+ * if all tuples are were pushed, false if the parent doesn't want to accept
+ * tuples anymore.
+ */
+static inline bool
+AggPushHashEntry(TupleHashEntry entry, void *astate)
+{
+	AggState *aggstate = (AggState *) astate;
+	ExprContext *econtext;
+	AggStatePerAgg peragg;
+	AggStatePerGroup pergroup;
+	TupleTableSlot *firstSlot;
+	TupleTableSlot *hashslot;
+	int i;
+
+	/*
+	 * get state info from node
+	 */
+	/* econtext is the per-output-tuple expression context */
+	econtext = aggstate->ss.ps.ps_ExprContext;
+	peragg = aggstate->peragg;
+	firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	hashslot = aggstate->hashslot;
+
+	/*
+	 * Clear the per-output-tuple context for each group
+	 *
+	 * We intentionally don't use ReScanExprContext here; if any aggs have
+	 * registered shutdown callbacks, they mustn't be called yet, since we
+	 * might not be done with that agg.
+	 */
+	ResetExprContext(econtext);
+
+	/*
+	 * Store the copied first input tuple in the tuple table slot reserved
+	 * for it, so that it can be used in ExecProject.
+	 */
+	ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
+	slot_getallattrs(hashslot);
+
+	ExecClearTuple(firstSlot);
+	memset(firstSlot->tts_isnull, true,
+		   firstSlot->tts_tupleDescriptor->natts * sizeof(bool));
+
+	for (i = 0; i < aggstate->numhashGrpCols; i++)
+	{
+		int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
+
+		firstSlot->tts_values[varNumber] = hashslot->tts_values[i];
+		firstSlot->tts_isnull[varNumber] = hashslot->tts_isnull[i];
+	}
+	ExecStoreVirtualTuple(firstSlot);
+
+	pergroup = (AggStatePerGroup) entry->additional;
+
+	finalize_aggregates(aggstate, peragg, pergroup, 0);
+
+	/*
+	 * Use the representative input tuple for any references to
+	 * non-aggregated input columns in the qual and tlist.
+	 */
+	econtext->ecxt_outertuple = firstSlot;
+
+	return project_aggregates_and_push(aggstate);
+}
+
+
+/*
  * find_unaggregated_cols
  *	  Construct a bitmapset of the column numbers of un-aggregated Vars
  *	  appearing in our targetlist and qual (HAVING clause)
@@ -1719,12 +1825,12 @@ build_hash_table(AggState *aggstate)
 
 	additionalsize = aggstate->numaggs * sizeof(AggStatePerGroupData);
 
-	aggstate->hashtable = BuildTupleHashTable(node->numCols,
-											  aggstate->hashGrpColIdxHash,
-											  aggstate->phase->eqfunctions,
-											  aggstate->hashfunctions,
-											  node->numGroups,
-											  additionalsize,
+	aggstate->hashtable = BuildAggTupleHashTable(node->numCols,
+												 aggstate->hashGrpColIdxHash,
+												 aggstate->phase->eqfunctions,
+												 aggstate->hashfunctions,
+												 node->numGroups,
+												 additionalsize,
 							 aggstate->aggcontexts[0]->ecxt_per_tuple_memory,
 											  tmpmem,
 								  DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
@@ -1845,7 +1951,7 @@ hash_agg_entry_size(int numAggs)
  *
  * When called, CurrentMemoryContext should be the per-query context.
  */
-static TupleHashEntryData *
+static inline TupleHashEntryData *
 lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 {
 	TupleTableSlot *hashslot = aggstate->hashslot;
@@ -1867,7 +1973,7 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 	ExecStoreVirtualTuple(hashslot);
 
 	/* find or create the hashtable entry using the filtered tuple */
-	entry = LookupTupleHashEntry(aggstate->hashtable, hashslot, &isnew);
+	entry = LookupAggTupleHashEntry(aggstate->hashtable, hashslot, &isnew);
 
 	if (isnew)
 	{
@@ -1883,43 +1989,38 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 }
 
 /*
- * ExecAgg -
- *
- *	  ExecAgg receives tuples from its outer subplan and aggregates over
- *	  the appropriate attribute for each aggregate function use (Aggref
- *	  node) appearing in the targetlist or qual of the node.  The number
- *	  of tuples to aggregate over depends on whether grouped or plain
- *	  aggregation is selected.  In grouped aggregation, we produce a result
- *	  row for each group; in plain aggregation there's a single result row
- *	  for the whole query.  In either case, the value of each aggregate is
- *	  stored in the expression context to be used when ExecProject evaluates
- *	  the result tuple.
+ * pushTupleToAgg -
+ *
+ *	  pushTupleToAgg receives tuples from its outer subplan and aggregates
+ *	  over the appropriate attribute for each aggregate function use (Aggref
+ *	  node) appearing in the targetlist or qual of the node.  The number of
+ *	  tuples to aggregate over depends on whether grouped or plain aggregation
+ *	  is selected.  In grouped aggregation, we produce a result row for each
+ *	  group; in plain aggregation there's a single result row for the whole
+ *	  query.  In either case, the value of each aggregate is stored in the
+ *	  expression context to be used when ExecProject evaluates the result
+ *	  tuple.
  */
-TupleTableSlot *
-ExecAgg(AggState *node)
+bool
+pushTupleToAgg(TupleTableSlot *slot, AggState *node)
 {
-	TupleTableSlot *result;
+	/* Only AGG_HASHED is supported at the moment */
+	Assert(node->phase->aggnode->aggstrategy == AGG_HASHED);
+	/* AGGSPLIT is not supported at the moment */
+	Assert(node->aggsplit == AGGSPLIT_SIMPLE);
+	Assert(!node->agg_done);
 
-	if (!node->agg_done)
+	if (!TupIsNull(slot))
 	{
-		/* Dispatch based on strategy */
-		switch (node->phase->aggnode->aggstrategy)
-		{
-			case AGG_HASHED:
-				if (!node->table_filled)
-					agg_fill_hash_table(node);
-				result = agg_retrieve_hash_table(node);
-				break;
-			default:
-				result = agg_retrieve_direct(node);
-				break;
-		}
-
-		if (!TupIsNull(result))
-			return result;
+		agg_puttup_hash_table(node, slot);
+		return true;
 	}
 
-	return NULL;
+	/* NULL tuple arrived, finalize aggregation and push tuples */
+	node->table_filled = true;
+	agg_push_hash_table(node);
+	/* doesn't matter */
+	return false;
 }
 
 /*
@@ -2247,141 +2348,45 @@ agg_retrieve_direct(AggState *aggstate)
 }
 
 /*
- * ExecAgg for hashed case: phase 1, read input and build hash table
+ * pushTupleToAgg for hashed case, add one tuple to hashtable
  */
 static void
-agg_fill_hash_table(AggState *aggstate)
+agg_puttup_hash_table(AggState *aggstate, TupleTableSlot *outerslot)
 {
 	ExprContext *tmpcontext;
 	TupleHashEntryData *entry;
-	TupleTableSlot *outerslot;
 
 	/*
-	 * get state info from node
-	 *
 	 * tmpcontext is the per-input-tuple expression context
 	 */
 	tmpcontext = aggstate->tmpcontext;
 
-	/*
-	 * Process each outer-plan tuple, and then fetch the next one, until we
-	 * exhaust the outer plan.
-	 */
-	for (;;)
-	{
-		outerslot = fetch_input_tuple(aggstate);
-		if (TupIsNull(outerslot))
-			break;
-		/* set up for advance_aggregates call */
-		tmpcontext->ecxt_outertuple = outerslot;
 
-		/* Find or build hashtable entry for this tuple's group */
-		entry = lookup_hash_entry(aggstate, outerslot);
+	/* set up for advance_aggregates call */
+	tmpcontext->ecxt_outertuple = outerslot;
 
-		/* Advance the aggregates */
-		if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
-			combine_aggregates(aggstate, (AggStatePerGroup) entry->additional);
-		else
-			advance_aggregates(aggstate, (AggStatePerGroup) entry->additional);
+	/* Find or build hashtable entry for this tuple's group */
+	entry = lookup_hash_entry(aggstate, outerslot);
 
-		/* Reset per-input-tuple context after each tuple */
-		ResetExprContext(tmpcontext);
-	}
+	/* Advance the aggregates */
+	advance_aggregates(aggstate, (AggStatePerGroup) entry->additional);
 
-	aggstate->table_filled = true;
-	/* Initialize to walk the hash table */
-	ResetTupleHashIterator(aggstate->hashtable, &aggstate->hashiter);
+	/* Reset per-input-tuple context after each tuple */
+	ResetExprContext(tmpcontext);
 }
 
 /*
- * ExecAgg for hashed case: phase 2, retrieving groups from hash table
+ * Hashed case, all tuples have arrived, now push them
  */
-static TupleTableSlot *
-agg_retrieve_hash_table(AggState *aggstate)
+static void
+agg_push_hash_table(AggState *aggstate)
 {
-	ExprContext *econtext;
-	AggStatePerAgg peragg;
-	AggStatePerGroup pergroup;
-	TupleHashEntryData *entry;
-	TupleTableSlot *firstSlot;
-	TupleTableSlot *result;
-	TupleTableSlot *hashslot;
-
-	/*
-	 * get state info from node
-	 */
-	/* econtext is the per-output-tuple expression context */
-	econtext = aggstate->ss.ps.ps_ExprContext;
-	peragg = aggstate->peragg;
-	firstSlot = aggstate->ss.ss_ScanTupleSlot;
-	hashslot = aggstate->hashslot;
-
-
-	/*
-	 * We loop retrieving groups until we find one satisfying
-	 * aggstate->ss.ps.qual
-	 */
-	while (!aggstate->agg_done)
-	{
-		int i;
-
-		/*
-		 * Find the next entry in the hash table
-		 */
-		entry = ScanTupleHashTable(aggstate->hashtable, &aggstate->hashiter);
-		if (entry == NULL)
-		{
-			/* No more entries in hashtable, so done */
-			aggstate->agg_done = TRUE;
-			return NULL;
-		}
-
-		/*
-		 * Clear the per-output-tuple context for each group
-		 *
-		 * We intentionally don't use ReScanExprContext here; if any aggs have
-		 * registered shutdown callbacks, they mustn't be called yet, since we
-		 * might not be done with that agg.
-		 */
-		ResetExprContext(econtext);
-
-		/*
-		 * Transform representative tuple back into one with the right
-		 * columns.
-		 */
-		ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
-		slot_getallattrs(hashslot);
-
-		ExecClearTuple(firstSlot);
-		memset(firstSlot->tts_isnull, true,
-			   firstSlot->tts_tupleDescriptor->natts * sizeof(bool));
-
-		for (i = 0; i < aggstate->numhashGrpCols; i++)
-		{
-			int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
-
-			firstSlot->tts_values[varNumber] = hashslot->tts_values[i];
-			firstSlot->tts_isnull[varNumber] = hashslot->tts_isnull[i];
-		}
-		ExecStoreVirtualTuple(firstSlot);
-
-		pergroup = (AggStatePerGroup) entry->additional;
-
-		finalize_aggregates(aggstate, peragg, pergroup, 0);
-
-		/*
-		 * Use the representative input tuple for any references to
-		 * non-aggregated input columns in the qual and tlist.
-		 */
-		econtext->ecxt_outertuple = firstSlot;
-
-		result = project_aggregates(aggstate);
-		if (result)
-			return result;
-	}
-
-	/* No more groups */
-	return NULL;
+	/* For each tuple in hashtable, push it */
+	if (aggtuplehash_foreach((aggtuplehash_hash *) aggstate->hashtable->hashtab,
+							 aggstate))
+		/* If parent still waits for tuples, let it know we are done */
+		pushTuple(NULL, aggstate->ss.ps.parent, (PlanState *) aggstate);
+	aggstate->agg_done = true;
 }
 
 /* -----------------
@@ -2392,7 +2397,7 @@ agg_retrieve_hash_table(AggState *aggstate)
  * -----------------
  */
 AggState *
-ExecInitAgg(Agg *node, EState *estate, int eflags)
+ExecInitAgg(Agg *node, EState *estate, int eflags, PlanState *parent)
 {
 	AggState   *aggstate;
 	AggStatePerAgg peraggs;
@@ -2421,6 +2426,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate = makeNode(AggState);
 	aggstate->ss.ps.plan = (Plan *) node;
 	aggstate->ss.ps.state = estate;
+	aggstate->ss.ps.parent = parent;
 
 	aggstate->aggs = NIL;
 	aggstate->numaggs = 0;
@@ -2523,7 +2529,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
+	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags,
+											(PlanState *) aggstate);
 
 	/*
 	 * initialize source tuple type.
@@ -3780,3 +3787,309 @@ aggregate_dummy(PG_FUNCTION_ARGS)
 		 fcinfo->flinfo->fn_oid);
 	return (Datum) 0;			/* keep compiler quiet */
 }
+
+/*
+ * We want to use our own hashtable instead of defined in execGrouping.c,
+ * because
+ * - we want to inline its interface functions
+ * - we want 'foreach' method with inlined action
+ *
+ * While we need new hashtable, stored type (TupleHashEntry) is exactly the
+ * same. Because of that types (tuplehash_hash *) and (aggtuplehash_hash *)
+ * are fully compatible. So, we won't change type of aggstate->hashtable to
+ * copy-pasted TupleHashTableData with the only field hashtab changed to
+ * aggtuplehas_hash *; instead, we will use casts where needed.
+ *
+ * Since functions in execGrouping.c are hard-linked with `tuplehash`
+ * hashtable defined there, we can't use them and need our own versions too,
+ * so they will be basically copypasted with changed hashtable name.  Of
+ * course, it is no good, but again, our goal for now is to estimate
+ * perfomance benefits. Later, if needed, execGrouping may be generalized to
+ * handle any hashtable.
+ *
+ */
+
+/*
+ * Define parameters for tuple hash table code generation.
+ */
+#define SH_PREFIX aggtuplehash
+#define SH_ELEMENT_TYPE TupleHashEntryData
+#define SH_KEY_TYPE MinimalTuple
+#define SH_KEY firstTuple
+#define SH_HASH_KEY(tb, key) AggTupleHashTableHash(tb, key)
+#define SH_EQUAL(tb, a, b) AggTupleHashTableMatch(tb, a, b) == 0
+#define SH_SCOPE static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_FOREACH_ON
+#define SH_FOREACH_ACC_TYPE bool
+#define SH_FOREACH_ACC_INIT true
+#define SH_FOREACH_FUNC AggPushHashEntry
+#define SH_FOREACH_ACC_FUNC inline_and
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static inline bool inline_and(bool old, bool new)
+{
+	return old && new;
+}
+
+/*
+ * Funcs from execGrouping.c
+ */
+
+/*
+ * Construct an empty TupleHashTable
+ *
+ *	numCols, keyColIdx: identify the tuple fields to use as lookup key
+ *	eqfunctions: equality comparison functions to use
+ *	hashfunctions: datatype-specific hashing functions to use
+ *	nbuckets: initial estimate of hashtable size
+ *	additionalsize: size of data stored in ->additional
+ *	tablecxt: memory context in which to store table and table entries
+ *	tempcxt: short-lived context for evaluation hash and comparison functions
+ *
+ * The function arrays may be made with execTuplesHashPrepare().  Note they
+ * are not cross-type functions, but expect to see the table datatype(s)
+ * on both sides.
+ *
+ * Note that keyColIdx, eqfunctions, and hashfunctions must be allocated in
+ * storage that will live as long as the hashtable does.
+ */
+static TupleHashTable
+BuildAggTupleHashTable(int numCols, AttrNumber *keyColIdx,
+					FmgrInfo *eqfunctions,
+					FmgrInfo *hashfunctions,
+					long nbuckets, Size additionalsize,
+					MemoryContext tablecxt, MemoryContext tempcxt,
+					bool use_variable_hash_iv)
+{
+	TupleHashTable hashtable;
+	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
+
+	Assert(nbuckets > 0);
+
+	/* Limit initial table size request to not more than work_mem */
+	nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / entrysize));
+
+	hashtable = (TupleHashTable)
+		MemoryContextAlloc(tablecxt, sizeof(TupleHashTableData));
+
+	hashtable->numCols = numCols;
+	hashtable->keyColIdx = keyColIdx;
+	hashtable->tab_hash_funcs = hashfunctions;
+	hashtable->tab_eq_funcs = eqfunctions;
+	hashtable->tablecxt = tablecxt;
+	hashtable->tempcxt = tempcxt;
+	hashtable->entrysize = entrysize;
+	hashtable->tableslot = NULL;	/* will be made on first lookup */
+	hashtable->inputslot = NULL;
+	hashtable->in_hash_funcs = NULL;
+	hashtable->cur_eq_funcs = NULL;
+
+	/*
+	 * If parallelism is in use, even if the master backend is performing the
+	 * scan itself, we don't want to create the hashtable exactly the same way
+	 * in all workers. As hashtables are iterated over in keyspace-order,
+	 * doing so in all processes in the same way is likely to lead to
+	 * "unbalanced" hashtables when the table size initially is
+	 * underestimated.
+	 */
+	if (use_variable_hash_iv)
+		hashtable->hash_iv = hash_uint32(ParallelWorkerNumber);
+	else
+		hashtable->hash_iv = 0;
+
+	hashtable->hashtab = (tuplehash_hash*) aggtuplehash_create(tablecxt,
+															   nbuckets,
+															   hashtable);
+
+	return hashtable;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the
+ * given tuple.  The tuple must be the same type as the hashtable entries.
+ *
+ * If isnew is NULL, we do not create new entries; we return NULL if no
+ * match is found.
+ *
+ * If isnew isn't NULL, then a new entry is created if no existing entry
+ * matches.  On return, *isnew is true if the entry is newly created,
+ * false if it existed already.  ->additional_data in the new entry has
+ * been zeroed.
+ */
+static inline TupleHashEntry
+LookupAggTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
+					 bool *isnew)
+{
+	TupleHashEntryData *entry;
+	MemoryContext oldContext;
+	bool		found;
+	MinimalTuple key;
+
+	/* If first time through, clone the input slot to make table slot */
+	if (hashtable->tableslot == NULL)
+	{
+		TupleDesc	tupdesc;
+
+		oldContext = MemoryContextSwitchTo(hashtable->tablecxt);
+
+		/*
+		 * We copy the input tuple descriptor just for safety --- we assume
+		 * all input tuples will have equivalent descriptors.
+		 */
+		tupdesc = CreateTupleDescCopy(slot->tts_tupleDescriptor);
+		hashtable->tableslot = MakeSingleTupleTableSlot(tupdesc);
+		MemoryContextSwitchTo(oldContext);
+	}
+
+	/* Need to run the hash functions in short-lived context */
+	oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+	/* set up data needed by hash and match functions */
+	hashtable->inputslot = slot;
+	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+	hashtable->cur_eq_funcs = hashtable->tab_eq_funcs;
+
+	key = NULL; /* flag to reference inputslot */
+
+	if (isnew)
+	{
+		entry = aggtuplehash_insert((aggtuplehash_hash *) hashtable->hashtab,
+									key, &found);
+
+		if (found)
+		{
+			/* found pre-existing entry */
+			*isnew = false;
+		}
+		else
+		{
+			/* created new entry */
+			*isnew = true;
+			/* zero caller data */
+			entry->additional = NULL;
+			MemoryContextSwitchTo(hashtable->tablecxt);
+			/* Copy the first tuple into the table context */
+			entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+		}
+	}
+	else
+	{
+		entry = aggtuplehash_lookup((aggtuplehash_hash *) hashtable->hashtab,
+									key);
+	}
+
+	MemoryContextSwitchTo(oldContext);
+
+	return entry;
+}
+
+/*
+ * Compute the hash value for a tuple
+ *
+ * The passed-in key is a pointer to TupleHashEntryData.  In an actual hash
+ * table entry, the firstTuple field points to a tuple (in MinimalTuple
+ * format).  LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
+ * NULL firstTuple field --- that cues us to look at the inputslot instead.
+ * This convention avoids the need to materialize virtual input tuples unless
+ * they actually need to get copied into the table.
+ *
+ * Also, the caller must select an appropriate memory context for running
+ * the hash functions.
+ */
+static inline uint32
+AggTupleHashTableHash(struct aggtuplehash_hash *tb, const MinimalTuple tuple)
+{
+	TupleHashTable hashtable = (TupleHashTable) tb->private_data;
+	int			numCols = hashtable->numCols;
+	AttrNumber *keyColIdx = hashtable->keyColIdx;
+	uint32		hashkey = hashtable->hash_iv;
+	TupleTableSlot *slot;
+	FmgrInfo   *hashfunctions;
+	int			i;
+
+	if (tuple == NULL)
+	{
+		/* Process the current input tuple for the table */
+		slot = hashtable->inputslot;
+		hashfunctions = hashtable->in_hash_funcs;
+	}
+	else
+	{
+		/*
+		 * Process a tuple already stored in the table.
+		 *
+		 * (this case never actually occurs due to the way simplehash.h is
+		 * used, as the hash-value is stored in the entries)
+		 */
+		slot = hashtable->tableslot;
+		ExecStoreMinimalTuple(tuple, slot, false);
+		hashfunctions = hashtable->tab_hash_funcs;
+	}
+
+	for (i = 0; i < numCols; i++)
+	{
+		AttrNumber	att = keyColIdx[i];
+		Datum		attr;
+		bool		isNull;
+
+		/* rotate hashkey left 1 bit at each step */
+		hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+
+		attr = slot_getattr(slot, att, &isNull);
+
+		if (!isNull)			/* treat nulls as having hash key 0 */
+		{
+			uint32		hkey;
+
+			hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i],
+												attr));
+			hashkey ^= hkey;
+		}
+	}
+
+	return hashkey;
+}
+
+/*
+ * See whether two tuples (presumably of the same hash value) match
+ *
+ * As above, the passed pointers are pointers to TupleHashEntryData.
+ *
+ * Also, the caller must select an appropriate memory context for running
+ * the compare functions.
+ */
+static inline int
+AggTupleHashTableMatch(struct aggtuplehash_hash *tb,
+					   const MinimalTuple tuple1,
+					   const MinimalTuple tuple2)
+{
+	TupleTableSlot *slot1;
+	TupleTableSlot *slot2;
+	TupleHashTable hashtable = (TupleHashTable) tb->private_data;
+
+	/*
+	 * We assume that simplehash.h will only ever call us with the first
+	 * argument being an actual table entry, and the second argument being
+	 * LookupTupleHashEntry's dummy TupleHashEntryData.  The other direction
+	 * could be supported too, but is not currently required.
+	 */
+	Assert(tuple1 != NULL);
+	slot1 = hashtable->tableslot;
+	ExecStoreMinimalTuple(tuple1, slot1, false);
+	Assert(tuple2 == NULL);
+	slot2 = hashtable->inputslot;
+
+	/* For crosstype comparisons, the inputslot must be first */
+	if (execTuplesMatch(slot2,
+						slot1,
+						hashtable->numCols,
+						hashtable->keyColIdx,
+						hashtable->cur_eq_funcs,
+						hashtable->tempcxt))
+		return 0;
+	else
+		return 1;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index eb4e27ce21..9b64c192ae 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -17,6 +17,7 @@
 #include "catalog/partition.h"
 #include "executor/execdesc.h"
 #include "nodes/parsenodes.h"
+#include "utils/memutils.h"
 
 
 /*
@@ -120,12 +121,6 @@ extern bool execCurrentOf(CurrentOfExpr *cexpr,
 /*
  * prototypes from functions in execGrouping.c
  */
-extern bool execTuplesMatch(TupleTableSlot *slot1,
-				TupleTableSlot *slot2,
-				int numCols,
-				AttrNumber *matchColIdx,
-				FmgrInfo *eqfunctions,
-				MemoryContext evalContext);
 extern bool execTuplesUnequal(TupleTableSlot *slot1,
 				  TupleTableSlot *slot2,
 				  int numCols,
@@ -414,4 +409,95 @@ extern void ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
 
 
+/*
+ * Below goes static inlined function moved from execGrouping.c: since
+ * we have inlined all hashtable interface functions in nodeAgg.c, why not
+ * inline execTuplesMatch too?
+ * Obviously this is not a good place for it, it should be moved to
+ * something like execGrouping.h and all calls updated.
+ */
+
+static inline bool execTuplesMatch(TupleTableSlot *slot1,
+								   TupleTableSlot *slot2,
+								   int numCols,
+								   AttrNumber *matchColIdx,
+								   FmgrInfo *eqfunctions,
+								   MemoryContext evalContext);
+
+/*
+ * execTuplesMatch
+ *		Return true if two tuples match in all the indicated fields.
+ *
+ * This actually implements SQL's notion of "not distinct".  Two nulls
+ * match, a null and a not-null don't match.
+ *
+ * slot1, slot2: the tuples to compare (must have same columns!)
+ * numCols: the number of attributes to be examined
+ * matchColIdx: array of attribute column numbers
+ * eqFunctions: array of fmgr lookup info for the equality functions to use
+ * evalContext: short-term memory context for executing the functions
+ *
+ * NB: evalContext is reset each time!
+ */
+static inline bool
+execTuplesMatch(TupleTableSlot *slot1,
+				TupleTableSlot *slot2,
+				int numCols,
+				AttrNumber *matchColIdx,
+				FmgrInfo *eqfunctions,
+				MemoryContext evalContext)
+{
+	MemoryContext oldContext;
+	bool		result;
+	int			i;
+
+	/* Reset and switch into the temp context. */
+	MemoryContextReset(evalContext);
+	oldContext = MemoryContextSwitchTo(evalContext);
+
+	/*
+	 * We cannot report a match without checking all the fields, but we can
+	 * report a non-match as soon as we find unequal fields.  So, start
+	 * comparing at the last field (least significant sort key). That's the
+	 * most likely to be different if we are dealing with sorted input.
+	 */
+	result = true;
+
+	for (i = numCols; --i >= 0;)
+	{
+		AttrNumber	att = matchColIdx[i];
+		Datum		attr1,
+					attr2;
+		bool		isNull1,
+					isNull2;
+
+		attr1 = slot_getattr(slot1, att, &isNull1);
+
+		attr2 = slot_getattr(slot2, att, &isNull2);
+
+		if (isNull1 != isNull2)
+		{
+			result = false;		/* one null and one not; they aren't equal */
+			break;
+		}
+
+		if (isNull1)
+			continue;			/* both are null, treat as equal */
+
+		/* Apply the type-specific equality function */
+
+		if (!DatumGetBool(FunctionCall2(&eqfunctions[i],
+										attr1, attr2)))
+		{
+			result = false;		/* they aren't equal */
+			break;
+		}
+	}
+
+	MemoryContextSwitchTo(oldContext);
+
+	return result;
+}
+
+
 #endif   /* EXECUTOR_H  */
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index d2fee52e12..b8c84bc5d7 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -16,8 +16,9 @@
 
 #include "nodes/execnodes.h"
 
-extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAgg(AggState *node);
+extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags,
+							 PlanState *parent);
+extern bool pushTupleToAgg(TupleTableSlot *slot, AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 6c6c3ee0d0..e865b87298 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -25,12 +25,33 @@
  *		declarations reside
  *    - SH_USE_NONDEFAULT_ALLOCATOR - if defined no element allocator functions
  *      are defined, so you can supply your own
+ *    - SH_FOREACH_ON -- if defined, SH_TYPE_foreach function for iterating
+ *		over the hashtable will be generated. This function works as follows:
+ *		SH_FOREACH_ACC_TYPE SH_TYPE_foreach(hashtable, void *direct_arg)
+ *		{
+ *		  accum = accum_init_val
+ *		  for each element in hashtable
+ *		    accum = accum_func(accum, foreach_func(element, direct_arg))
+ *		  return accum
+ *		}
+ *	  So, you have to specify the following macros if you use this:
+ *	  - SH_FOREACH_ACC_TYPE -- type of accum
+ *    - Some more if SH_DEFINE is defined
+
  *	  The following parameters are only relevant when SH_DEFINE is defined:
  *	  - SH_KEY - name of the element in SH_ELEMENT_TYPE containing the hash key
  *	  - SH_EQUAL(table, a, b) - compare two table keys
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *    Macros for foreach:
+ *	  - SH_FOREACH_ACC_INIT -- initial value of accum
+ *	  - SH_FOREACH_FUNC -- name of foreach_func, it's prototype is
+ *	    SH_FOREACH_ACC_TYPE SH_TYPE_foreach(SH_ELEMENT_TYPE el,
+ *  										void *direct agg)
+ *    - SH_FOREACH_ACC_FUNC -- name of accum_func, it's prototype is
+ *      SH_FOREACH_ACC_TYPE accum_func(SH_FOREACH_ACC_TYPE old,
+ *									   SH_FOREACH_ACC_TYPE new)
  *
  *	  For examples of usage look at simplehash.c (file local definition) and
  *	  execnodes.h/execGrouping.c (exposed declaration, file local
@@ -75,6 +96,7 @@
 #define SH_INSERT SH_MAKE_NAME(insert)
 #define SH_DELETE SH_MAKE_NAME(delete)
 #define SH_LOOKUP SH_MAKE_NAME(lookup)
+#define SH_FOREACH SH_MAKE_NAME(foreach)
 #define SH_GROW SH_MAKE_NAME(grow)
 #define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
 #define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
@@ -147,6 +169,9 @@ SH_SCOPE bool SH_DELETE(SH_TYPE *tb, SH_KEY_TYPE key);
 SH_SCOPE void SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter);
 SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE *tb, SH_ITERATOR *iter, uint32 at);
 SH_SCOPE SH_ELEMENT_TYPE *SH_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter);
+#ifdef SH_FOREACH_ON
+SH_SCOPE SH_FOREACH_ACC_TYPE SH_FOREACH(SH_TYPE *tb, void *direct_arg);
+#endif	 /* SH_FOREACH_ON */
 SH_SCOPE void SH_STAT(SH_TYPE *tb);
 
 #endif   /* SH_DECLARE */
@@ -827,6 +852,35 @@ SH_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter)
 }
 
 /*
+ * Iterate over the hashtable, doing something with each value and accumulating
+ * the result.
+ */
+#ifdef SH_FOREACH_ON
+SH_SCOPE SH_FOREACH_ACC_TYPE SH_FOREACH(SH_TYPE *tb, void *direct_arg)
+{
+	uint32 cur = 0;
+	SH_FOREACH_ACC_TYPE accum = SH_FOREACH_ACC_INIT;
+	SH_FOREACH_ACC_TYPE new_accum;
+	SH_ELEMENT_TYPE *elem;
+
+	do
+	{
+		elem = &tb->data[cur];
+		if (elem->status == SH_STATUS_IN_USE)
+		{
+			new_accum = SH_FOREACH_FUNC(elem, direct_arg);
+			accum = SH_FOREACH_ACC_FUNC(accum, new_accum);
+		}
+		/* next element in forward direction */
+		cur = (cur + 1) & tb->sizemask;
+	} while (cur != 0);
+
+	return accum;
+}
+#endif	 /* SH_FOREACH_ON */
+
+
+/*
  * Report some statistics about the state of the hashtable. For
  * debugging/profiling purposes only.
  */
@@ -914,6 +968,11 @@ SH_STAT(SH_TYPE *tb)
 #undef SH_GET_HASH
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
+#undef SH_FOREACH_ON
+#undef SH_FOREACH_ACC_TYPE
+#undef SH_FOREACH_ACC_INIT
+#undef SH_FOREACH_FUNC
+#undef SH_FOREACH_ACC_FUNC
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
@@ -942,6 +1001,7 @@ SH_STAT(SH_TYPE *tb)
 #undef SH_START_ITERATE
 #undef SH_START_ITERATE_AT
 #undef SH_ITERATE
+#undef SH_FOREACH
 #undef SH_ALLOCATE
 #undef SH_FREE
 #undef SH_STAT
-- 
2.11.0

0008-Reversed-in-memory-Sort-implementation.patchtext/x-diffDownload
From 54f81da3169d488afee133053897a52f65481809 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Tue, 14 Mar 2017 20:03:41 +0300
Subject: [PATCH 8/8] Reversed in-memory Sort implementation.

Only in-memory sort is supported for now.
---
 src/backend/executor/execProcnode.c |  12 +++
 src/backend/executor/nodeSort.c     | 143 ++++++++++++------------------------
 src/backend/utils/sort/tuplesort.c  |  35 +++++++++
 src/include/executor/nodeSort.h     |   5 +-
 src/include/utils/tuplesort.h       |   3 +
 5 files changed, 98 insertions(+), 100 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 1aca5f0d75..a7e29a126b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -168,6 +168,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 		/*
 		 * materialization nodes
 		 */
+		case T_Sort:
+			result = (PlanState *) ExecInitSort((Sort *) node,
+												estate, eflags, parent);
+			break;
+
 		case T_Agg:
 			result = (PlanState *) ExecInitAgg((Agg *) node,
 											   estate, eflags, parent);
@@ -261,6 +266,9 @@ pushTuple(TupleTableSlot *slot, PlanState *node, PlanState *pusher)
 	if (nodeTag(node) == T_LimitState)
 		return pushTupleToLimit(slot, (LimitState *) node);
 
+	else if (nodeTag(node) == T_SortState)
+		return pushTupleToSort(slot, (SortState *) node);
+
 	else if (nodeTag(node) == T_AggState)
 		return pushTupleToAgg(slot, (AggState *) node);
 
@@ -332,6 +340,10 @@ ExecEndNode(PlanState *node)
 		/*
 		 * materialization nodes
 		 */
+		case T_SortState:
+			ExecEndSort((SortState *) node);
+			break;
+
 		case T_AggState:
 			ExecEndAgg((AggState *) node);
 			break;
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 0028912509..8ba501e9ac 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -22,11 +22,11 @@
 
 
 /* ----------------------------------------------------------------
- *		ExecSort
+ *		pushTupleToSort
  *
  *		Sorts tuples from the outer subtree of the node using tuplesort,
- *		which saves the results in a temporary file or memory. After the
- *		initial call, returns a tuple from the file with each call.
+ *		which saves the results in a temporary file or memory. After receiving
+ *		NULL slot, pushes all tuples.
  *
  *		Conditions:
  *		  -- none.
@@ -35,110 +35,55 @@
  *		  -- the outer child is prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
-ExecSort(SortState *node)
+bool
+pushTupleToSort(TupleTableSlot *slot, SortState *node)
 {
-	EState	   *estate;
-	ScanDirection dir;
-	Tuplesortstate *tuplesortstate;
-	TupleTableSlot *slot;
+	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
 
-	/*
-	 * get state info from node
-	 */
-	SO1_printf("ExecSort: %s\n",
-			   "entering routine");
-
-	estate = node->ss.ps.state;
-	dir = estate->es_direction;
-	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
-
-	/*
-	 * If first time through, read all tuples from outer plan and pass them to
-	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
-	 */
+	/* bounded nodes not supported yet */
+	Assert(!node->bounded);
+	/* only forward direction is supported for now */
+	Assert(ScanDirectionIsForward(node->ss.ps.state->es_direction));
+	Assert(!node->sort_Done);
 
-	if (!node->sort_Done)
+	if (!TupIsNull(slot))
 	{
-		Sort	   *plannode = (Sort *) node->ss.ps.plan;
-		PlanState  *outerNode;
-		TupleDesc	tupDesc;
-
-		SO1_printf("ExecSort: %s\n",
-				   "sorting subplan");
-
-		/*
-		 * Want to scan subplan in the forward direction while creating the
-		 * sorted data.
-		 */
-		estate->es_direction = ForwardScanDirection;
-
-		/*
-		 * Initialize tuplesort module.
-		 */
-		SO1_printf("ExecSort: %s\n",
-				   "calling tuplesort_begin");
-
-		outerNode = outerPlanState(node);
-		tupDesc = ExecGetResultType(outerNode);
-
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-											  plannode->numCols,
-											  plannode->sortColIdx,
-											  plannode->sortOperators,
-											  plannode->collations,
-											  plannode->nullsFirst,
-											  work_mem,
-											  node->randomAccess);
-		if (node->bounded)
-			tuplesort_set_bound(tuplesortstate, node->bound);
-		node->tuplesortstate = (void *) tuplesortstate;
-
-		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
-		 */
-
-		for (;;)
+		if (node->tuplesortstate == NULL)
 		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-				break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
+			/* first call, time to create tuplesort */
+			outerNode = outerPlanState(node);
+			tupDesc = ExecGetResultType(outerNode);
+
+			node->tuplesortstate = tuplesort_begin_heap(tupDesc,
+														plannode->numCols,
+														plannode->sortColIdx,
+														plannode->sortOperators,
+														plannode->collations,
+														plannode->nullsFirst,
+														work_mem,
+														node->randomAccess);
 		}
-
-		/*
-		 * Complete the sort.
-		 */
-		tuplesort_performsort(tuplesortstate);
-
-		/*
-		 * restore to user specified direction
-		 */
-		estate->es_direction = dir;
-
-		/*
-		 * finally set the sorted flag to true
-		 */
-		node->sort_Done = true;
-		node->bounded_Done = node->bounded;
-		node->bound_Done = node->bound;
-		SO1_printf("ExecSort: %s\n", "sorting done");
+		/* feed the tuple to tuplesort */
+		tuplesort_puttupleslot(node->tuplesortstate, slot);
+		return true;
 	}
 
-	SO1_printf("ExecSort: %s\n",
-			   "retrieving tuple from tuplesort");
+	/* NULL tuple arrived, sort and push tuples */
 
 	/*
-	 * Get the first or next tuple from tuplesort. Returns NULL if no more
-	 * tuples.
+	 * Complete the sort.
 	 */
-	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-								  ScanDirectionIsForward(dir),
-								  slot, NULL);
-	return slot;
+	tuplesort_performsort(node->tuplesortstate);
+	node->sort_Done = true;
+
+	if (tuplesort_pushtuples(node->tuplesortstate, node))
+		/* If parent still waits for tuples, let it know we are done */
+		pushTuple(NULL, node->ss.ps.parent, (PlanState *) node);
+
+	/* doesn't matter */
+	return false;
 }
 
 /* ----------------------------------------------------------------
@@ -149,7 +94,7 @@ ExecSort(SortState *node)
  * ----------------------------------------------------------------
  */
 SortState *
-ExecInitSort(Sort *node, EState *estate, int eflags)
+ExecInitSort(Sort *node, EState *estate, int eflags, PlanState *parent)
 {
 	SortState  *sortstate;
 
@@ -162,6 +107,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate = makeNode(SortState);
 	sortstate->ss.ps.plan = (Plan *) node;
 	sortstate->ss.ps.state = estate;
+	sortstate->ss.ps.parent = parent;
 
 	/*
 	 * We must have random access to the sort output to do backward scan or
@@ -199,7 +145,8 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
+	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags,
+											 (PlanState *) sortstate);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d5f0..96ffadafc5 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2080,6 +2080,41 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 }
 
 /*
+ * Push every tuple from tuplesort. Returns last pushTuple() result.
+ */
+bool
+tuplesort_pushtuples(Tuplesortstate *state, SortState *node)
+{
+	SortTuple	stup;
+	MemoryContext oldcontext = CurrentMemoryContext;
+	TupleTableSlot *slot = node->ss.ps.ps_ResultTupleSlot;
+	bool parent_accepts_tuples = true;
+
+	/* only in mem sort is supported for now */
+	Assert(state->status == TSS_SORTEDINMEM);
+	Assert(!state->slabAllocatorUsed);
+
+	while (state->current < state->memtupcount)
+	{
+		/* Imitating context switching as it was before */
+		MemoryContextSwitchTo(state->sortcontext);
+		stup = state->memtuples[state->current++];
+		MemoryContextSwitchTo(oldcontext);
+
+		stup.tuple = heap_copy_minimal_tuple((MinimalTuple) stup.tuple);
+		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, true);
+		parent_accepts_tuples =
+			pushTuple(slot, node->ss.ps.parent, (PlanState *) node);
+		if (!parent_accepts_tuples)
+			return false;
+	}
+
+	state->eof_reached = true;
+	ExecClearTuple(slot);
+	return parent_accepts_tuples;
+}
+
+/*
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 10d16b47b1..27a7fb02d5 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -16,8 +16,9 @@
 
 #include "nodes/execnodes.h"
 
-extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSort(SortState *node);
+extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags,
+							   PlanState *parent);
+extern bool pushTupleToSort(TupleTableSlot *slot, SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f4752f4..a3770c3af8 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -92,6 +92,9 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 
 extern void tuplesort_performsort(Tuplesortstate *state);
 
+/* forward decl, since now we need to know about SortState */
+typedef struct SortState SortState;
+extern bool tuplesort_pushtuples(Tuplesortstate *state, SortState *node);
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
-- 
2.11.0

#4Arseny Sher
sher-ars@ispras.ru
In reply to: Robert Haas (#2)
Re: [GSoC] Push-based query executor discussion

While I admire your fearlessness, I think the chances of you being
able to bring a project of this type to a successful conclusion are
remote. Here is what I said about this topic previously:

/messages/by-id/CA+Tgmoa=kzHJ+TwxyQ+vKu21nk3prkRjSdbhjubN7qvc8UKuGg@mail.gmail.com

Well, as I said, I don't pretend that I will support full functionality:

instead, we should decide which part of this work (if any) is
going to be done in the course of GSoC. Probably, all TPC-H queries
with and without index support is a good initial target, but this
needs to be discussed.

I think that successfull completion of this project should be a clear
and justified answer to the question "Is this idea is good enough to
work on merging it into the master?", not the production-ready patches
themselves. Nevertheless, of course project success criterion must be
reasonably formalized -- e.g. implement nodes X with features Y, etc.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Oleg Bartunov
obartunov@gmail.com
In reply to: Arseny Sher (#4)
Re: [HACKERS] [GSoC] Push-based query executor discussion

On Wed, Mar 22, 2017 at 8:04 PM, Arseny Sher <sher-ars@ispras.ru> wrote:

While I admire your fearlessness, I think the chances of you being
able to bring a project of this type to a successful conclusion are
remote. Here is what I said about this topic previously:

/messages/by-id/CA+Tgmoa=kzHJ+TwxyQ+vKu21nk3prkRjSdbhjubN7qvc8UKuG

g@mail.gmail.com

Well, as I said, I don't pretend that I will support full functionality:

instead, we should decide which part of this work (if any) is
going to be done in the course of GSoC. Probably, all TPC-H queries
with and without index support is a good initial target, but this
needs to be discussed.

I think that successfull completion of this project should be a clear
and justified answer to the question "Is this idea is good enough to
work on merging it into the master?", not the production-ready patches
themselves. Nevertheless, of course project success criterion must be
reasonably formalized -- e.g. implement nodes X with features Y, etc.

How many GSoC slots and possible students we have ?

Should we reject this interesting project, which based on several years of
research work of academician group in the institute ? May be better help
him to reformulate the scope of project and let him work ? I don't know
exactly if the results of GSoC project should be committed , but as a
research project it's certainly would be useful for the community.

Show quoted text

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Noname
sher-ars@ispras.ru
In reply to: Oleg Bartunov (#5)
Re: [GSoC] Push-based query executor discussion

Oleg Bartunov <obartunov@gmail.com> writes:

I don't know exactly if the results of GSoC project should be
committed,

Technically, they are not required:
https://developers.google.com/open-source/gsoc/faq

Are mentoring organizations required to use the code produced by
students?

No. While we hope that all the code that comes out of this program will
find a happy home, we don’t require organizations to use the student's'
code.

--
Arseny Sher

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Arseny Sher
sher-ars@ispras.ru
In reply to: Noname (#6)
8 attachment(s)
Re: [GSoC] Push-based query executor discussion

I have cleaned up the code a bit and added the separation I mentioned in
a previous mail -- now we there are even three functions instead of old
ExecProcNode: one for starting leaf nodes, one for passing tuples and
one for signaling that the node has finished its job. It is all
described in execProcnode.c.

I also rewrote HashJoin without using the explicit state machine. It
seems slightly cleaner to me now...

Here are updated benchmarks:
+-----+-----------+---------+----------+
|query|reversed, s|master, s|speedup, %|
+-----+-----------+---------+----------+
|q01 |108.21 |117.88 |8.94 |
+-----+-----------+---------+----------+
|q03 |55.48 |58.805 |5.99 |
+-----+-----------+---------+----------+
|q04 |78.405 |81.86 |4.41 |
+-----+-----------+---------+----------+
|q05 |49.91 |51.18 |2.54 |
+-----+-----------+---------+----------+
|q10 |49.215 |52.61 |6.9 |
+-----+-----------+---------+----------+
|q12 |63.24 |68.505 |8.33 |
+-----+-----------+---------+----------+
|q14 |33.42 |35.31 |5.66 |
+-----+-----------+---------+----------+

As before, 24 runs were performed, median taken, scale is 40GB,
postgresql.conf is the same.

Patches are rebased, now they apply on 4dd3abe99f50.

--
Arseny Sher

Attachments:

0001-parent-param-added-to-ExecInitNode-parent-field-adde.patchtext/x-diffDownload
From 5391e2f6cd7607b85943b1caa3795e30ac9856dd Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 15:02:37 +0300
Subject: [PATCH 1/8] parent param added to ExecInitNode, parent field added to
 PlanState

---
 src/backend/executor/execMain.c           | 8 ++++----
 src/backend/executor/execProcnode.c       | 3 ++-
 src/backend/executor/nodeAgg.c            | 2 +-
 src/backend/executor/nodeAppend.c         | 2 +-
 src/backend/executor/nodeBitmapAnd.c      | 2 +-
 src/backend/executor/nodeBitmapHeapscan.c | 2 +-
 src/backend/executor/nodeBitmapOr.c       | 2 +-
 src/backend/executor/nodeForeignscan.c    | 2 +-
 src/backend/executor/nodeGather.c         | 2 +-
 src/backend/executor/nodeGatherMerge.c    | 2 +-
 src/backend/executor/nodeGroup.c          | 2 +-
 src/backend/executor/nodeHash.c           | 3 ++-
 src/backend/executor/nodeHashjoin.c       | 6 ++++--
 src/backend/executor/nodeLimit.c          | 2 +-
 src/backend/executor/nodeLockRows.c       | 2 +-
 src/backend/executor/nodeMaterial.c       | 2 +-
 src/backend/executor/nodeMergeAppend.c    | 2 +-
 src/backend/executor/nodeMergejoin.c      | 5 +++--
 src/backend/executor/nodeModifyTable.c    | 2 +-
 src/backend/executor/nodeNestloop.c       | 4 ++--
 src/backend/executor/nodeProjectSet.c     | 2 +-
 src/backend/executor/nodeRecursiveunion.c | 4 ++--
 src/backend/executor/nodeResult.c         | 2 +-
 src/backend/executor/nodeSetOp.c          | 2 +-
 src/backend/executor/nodeSort.c           | 2 +-
 src/backend/executor/nodeSubqueryscan.c   | 2 +-
 src/backend/executor/nodeUnique.c         | 2 +-
 src/backend/executor/nodeWindowAgg.c      | 2 +-
 src/include/executor/executor.h           | 3 ++-
 src/include/nodes/execnodes.h             | 1 +
 30 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c28cf9c8ea..a53709bba0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -984,7 +984,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 		if (bms_is_member(i, plannedstmt->rewindPlanIDs))
 			sp_eflags |= EXEC_FLAG_REWIND;
 
-		subplanstate = ExecInitNode(subplan, estate, sp_eflags);
+		subplanstate = ExecInitNode(subplan, estate, sp_eflags, NULL);
 
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
@@ -997,7 +997,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
 	 * tree.  This opens files, allocates storage and leaves us ready to start
 	 * processing tuples.
 	 */
-	planstate = ExecInitNode(plan, estate, eflags);
+	planstate = ExecInitNode(plan, estate, eflags, NULL);
 
 	/*
 	 * Get the tuple descriptor describing the type of tuples to return.
@@ -3040,7 +3040,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 		Plan	   *subplan = (Plan *) lfirst(l);
 		PlanState  *subplanstate;
 
-		subplanstate = ExecInitNode(subplan, estate, 0);
+		subplanstate = ExecInitNode(subplan, estate, 0, NULL);
 		estate->es_subplanstates = lappend(estate->es_subplanstates,
 										   subplanstate);
 	}
@@ -3050,7 +3050,7 @@ EvalPlanQualStart(EPQState *epqstate, EState *parentestate, Plan *planTree)
 	 * of the plan tree we need to run.  This opens files, allocates storage
 	 * and leaves us ready to start processing tuples.
 	 */
-	epqstate->planstate = ExecInitNode(planTree, estate, 0);
+	epqstate->planstate = ExecInitNode(planTree, estate, 0, NULL);
 
 	MemoryContextSwitchTo(oldcontext);
 }
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 80c77addb8..c1c4cecd6c 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -131,12 +131,13 @@
  *		  'node' is the current node of the plan produced by the query planner
  *		  'estate' is the shared execution state for the plan tree
  *		  'eflags' is a bitwise OR of flag bits described in executor.h
+ *        'parent' is parent of the node
  *
  *		Returns a PlanState node corresponding to the given Plan node.
  * ------------------------------------------------------------------------
  */
 PlanState *
-ExecInitNode(Plan *node, EState *estate, int eflags)
+ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 {
 	PlanState  *result;
 	List	   *subps;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3207ee460c..fa19358d19 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2523,7 +2523,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * initialize source tuple type.
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index a107545b83..86b4fcad84 100644
--- a/src/backend/executor/nodeAppend.c
+++ b/src/backend/executor/nodeAppend.c
@@ -171,7 +171,7 @@ ExecInitAppend(Append *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		appendplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		appendplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index e4eb028ff9..c2a2f7d30a 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -81,7 +81,7 @@ ExecInitBitmapAnd(BitmapAnd *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
index 2e9ff7d1b9..c0bcfb5d98 100644
--- a/src/backend/executor/nodeBitmapHeapscan.c
+++ b/src/backend/executor/nodeBitmapHeapscan.c
@@ -903,7 +903,7 @@ ExecInitBitmapHeapScan(BitmapHeapScan *node, EState *estate, int eflags)
 	 * relation's indexes, and we want to be sure we have acquired a lock on
 	 * the relation first.
 	 */
-	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(scanstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * all done.
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index c0f261407b..c834e29abb 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -82,7 +82,7 @@ ExecInitBitmapOr(BitmapOr *node, EState *estate, int eflags)
 	foreach(l, node->bitmapplans)
 	{
 		initNode = (Plan *) lfirst(l);
-		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		bitmapplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeForeignscan.c b/src/backend/executor/nodeForeignscan.c
index 3b6d1390eb..2e6ceb8b34 100644
--- a/src/backend/executor/nodeForeignscan.c
+++ b/src/backend/executor/nodeForeignscan.c
@@ -222,7 +222,7 @@ ExecInitForeignScan(ForeignScan *node, EState *estate, int eflags)
 	/* Initialize any outer plan. */
 	if (outerPlan(node))
 		outerPlanState(scanstate) =
-			ExecInitNode(outerPlan(node), estate, eflags);
+			ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * Tell the FDW to initialize the scan.
diff --git a/src/backend/executor/nodeGather.c b/src/backend/executor/nodeGather.c
index 32c97d390e..0031898acf 100644
--- a/src/backend/executor/nodeGather.c
+++ b/src/backend/executor/nodeGather.c
@@ -98,7 +98,7 @@ ExecInitGather(Gather *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gatherstate) = ExecInitNode(outerNode, estate, eflags, NULL);
 
 	/*
 	 * Initialize result tuple type and projection info.
diff --git a/src/backend/executor/nodeGatherMerge.c b/src/backend/executor/nodeGatherMerge.c
index 72f30ab4e6..7ed0c5bc0c 100644
--- a/src/backend/executor/nodeGatherMerge.c
+++ b/src/backend/executor/nodeGatherMerge.c
@@ -102,7 +102,7 @@ ExecInitGatherMerge(GatherMerge *node, EState *estate, int eflags)
 	 * now initialize outer plan
 	 */
 	outerNode = outerPlan(node);
-	outerPlanState(gm_state) = ExecInitNode(outerNode, estate, eflags);
+	outerPlanState(gm_state) = ExecInitNode(outerNode, estate, eflags, NULL);
 
 	/*
 	 * Initialize result tuple type and projection info.
diff --git a/src/backend/executor/nodeGroup.c b/src/backend/executor/nodeGroup.c
index 66c095bc72..5338e29187 100644
--- a/src/backend/executor/nodeGroup.c
+++ b/src/backend/executor/nodeGroup.c
@@ -198,7 +198,7 @@ ExecInitGroup(Group *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(grpstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * initialize tuple type.
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index e695d8834b..43e65ca04e 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -200,7 +200,8 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags,
+											 (PlanState*) hashstate);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index c50d93f43d..b48863f90b 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -435,8 +435,10 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	outerNode = outerPlan(node);
 	hashNode = (Hash *) innerPlan(node);
 
-	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags);
-	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags);
+	outerPlanState(hjstate) = ExecInitNode(outerNode, estate, eflags,
+										   (PlanState *) hjstate);
+	innerPlanState(hjstate) = ExecInitNode((Plan *) hashNode, estate, eflags,
+										   (PlanState *) hjstate);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index aaec132218..bcacbfc13b 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -403,7 +403,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeLockRows.c b/src/backend/executor/nodeLockRows.c
index b098034337..446a5e6fb3 100644
--- a/src/backend/executor/nodeLockRows.c
+++ b/src/backend/executor/nodeLockRows.c
@@ -376,7 +376,7 @@ ExecInitLockRows(LockRows *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(lrstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * LockRows nodes do no projections, so initialize projection info for
diff --git a/src/backend/executor/nodeMaterial.c b/src/backend/executor/nodeMaterial.c
index aa5d2529f4..97d025977f 100644
--- a/src/backend/executor/nodeMaterial.c
+++ b/src/backend/executor/nodeMaterial.c
@@ -219,7 +219,7 @@ ExecInitMaterial(Material *node, EState *estate, int eflags)
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
 	outerPlan = outerPlan(node);
-	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(matstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeMergeAppend.c b/src/backend/executor/nodeMergeAppend.c
index 8a2e78266b..a98a927b94 100644
--- a/src/backend/executor/nodeMergeAppend.c
+++ b/src/backend/executor/nodeMergeAppend.c
@@ -118,7 +118,7 @@ ExecInitMergeAppend(MergeAppend *node, EState *estate, int eflags)
 	{
 		Plan	   *initNode = (Plan *) lfirst(lc);
 
-		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags);
+		mergeplanstates[i] = ExecInitNode(initNode, estate, eflags, NULL);
 		i++;
 	}
 
diff --git a/src/backend/executor/nodeMergejoin.c b/src/backend/executor/nodeMergejoin.c
index 105e2dcedb..68c53ba1fe 100644
--- a/src/backend/executor/nodeMergejoin.c
+++ b/src/backend/executor/nodeMergejoin.c
@@ -1473,9 +1473,10 @@ ExecInitMergeJoin(MergeJoin *node, EState *estate, int eflags)
 	 *
 	 * inner child must support MARK/RESTORE.
 	 */
-	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(mergestate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 	innerPlanState(mergestate) = ExecInitNode(innerPlan(node), estate,
-											  eflags | EXEC_FLAG_MARK);
+											  eflags | EXEC_FLAG_MARK,
+											  NULL);
 
 	/*
 	 * For certain types of inner child nodes, it is advantageous to issue
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 29c6a6e1d8..03674f23dd 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1704,7 +1704,7 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 
 		/* Now init the plan for this result rel */
 		estate->es_result_relation_info = resultRelInfo;
-		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags);
+		mtstate->mt_plans[i] = ExecInitNode(subplan, estate, eflags, NULL);
 
 		/* Also let FDWs init themselves for foreign-table result rels */
 		if (!resultRelInfo->ri_usesFdwDirectModify &&
diff --git a/src/backend/executor/nodeNestloop.c b/src/backend/executor/nodeNestloop.c
index cac7ba1b9b..697f5d48a2 100644
--- a/src/backend/executor/nodeNestloop.c
+++ b/src/backend/executor/nodeNestloop.c
@@ -302,12 +302,12 @@ ExecInitNestLoop(NestLoop *node, EState *estate, int eflags)
 	 * inner child, because it will always be rescanned with fresh parameter
 	 * values.
 	 */
-	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(nlstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 	if (node->nestParams == NIL)
 		eflags |= EXEC_FLAG_REWIND;
 	else
 		eflags &= ~EXEC_FLAG_REWIND;
-	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags);
+	innerPlanState(nlstate) = ExecInitNode(innerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * tuple table initialization
diff --git a/src/backend/executor/nodeProjectSet.c b/src/backend/executor/nodeProjectSet.c
index eae0f1dad9..0c61685430 100644
--- a/src/backend/executor/nodeProjectSet.c
+++ b/src/backend/executor/nodeProjectSet.c
@@ -240,7 +240,7 @@ ExecInitProjectSet(ProjectSet *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(state) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(state) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeRecursiveunion.c b/src/backend/executor/nodeRecursiveunion.c
index fc1c00d68f..4b91f155c9 100644
--- a/src/backend/executor/nodeRecursiveunion.c
+++ b/src/backend/executor/nodeRecursiveunion.c
@@ -235,8 +235,8 @@ ExecInitRecursiveUnion(RecursiveUnion *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags);
-	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags);
+	outerPlanState(rustate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
+	innerPlanState(rustate) = ExecInitNode(innerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * If hashing, precompute fmgr lookup data for inner loop, and create the
diff --git a/src/backend/executor/nodeResult.c b/src/backend/executor/nodeResult.c
index b5b50b21e9..bbc0c82c3f 100644
--- a/src/backend/executor/nodeResult.c
+++ b/src/backend/executor/nodeResult.c
@@ -221,7 +221,7 @@ ExecInitResult(Result *node, EState *estate, int eflags)
 	/*
 	 * initialize child nodes
 	 */
-	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(resstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * we don't use inner plan
diff --git a/src/backend/executor/nodeSetOp.c b/src/backend/executor/nodeSetOp.c
index 85b3f67b33..f437ec1044 100644
--- a/src/backend/executor/nodeSetOp.c
+++ b/src/backend/executor/nodeSetOp.c
@@ -526,7 +526,7 @@ ExecInitSetOp(SetOp *node, EState *estate, int eflags)
 	 */
 	if (node->strategy == SETOP_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
-	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(setopstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * setop nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 591a31aa6a..0028912509 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -199,7 +199,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/executor/nodeSubqueryscan.c b/src/backend/executor/nodeSubqueryscan.c
index 230a96f9d2..b3cbe266dc 100644
--- a/src/backend/executor/nodeSubqueryscan.c
+++ b/src/backend/executor/nodeSubqueryscan.c
@@ -136,7 +136,7 @@ ExecInitSubqueryScan(SubqueryScan *node, EState *estate, int eflags)
 	/*
 	 * initialize subquery
 	 */
-	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags);
+	subquerystate->subplan = ExecInitNode(node->subplan, estate, eflags, NULL);
 
 	/*
 	 * Initialize scan tuple type (needed by ExecAssignScanProjectionInfo)
diff --git a/src/backend/executor/nodeUnique.c b/src/backend/executor/nodeUnique.c
index 28cc1e90f8..244c49f2dc 100644
--- a/src/backend/executor/nodeUnique.c
+++ b/src/backend/executor/nodeUnique.c
@@ -143,7 +143,7 @@ ExecInitUnique(Unique *node, EState *estate, int eflags)
 	/*
 	 * then initialize outer plan
 	 */
-	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags);
+	outerPlanState(uniquestate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
 
 	/*
 	 * unique nodes do no projections, so initialize projection info for this
diff --git a/src/backend/executor/nodeWindowAgg.c b/src/backend/executor/nodeWindowAgg.c
index 2a123e8452..39971458d1 100644
--- a/src/backend/executor/nodeWindowAgg.c
+++ b/src/backend/executor/nodeWindowAgg.c
@@ -1841,7 +1841,7 @@ ExecInitWindowAgg(WindowAgg *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags);
+	outerPlanState(winstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
 
 	/*
 	 * initialize source tuple type (which is also the tuple type that we'll
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index a5c75e771f..716fa9dc27 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -235,7 +235,8 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
 /*
  * prototypes from functions in execProcnode.c
  */
-extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
+extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags,
+	PlanState *parent);
 extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f856f6036f..738f098b00 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1062,6 +1062,7 @@ typedef struct PlanState
 	 */
 	List	   *targetlist;		/* target list to be computed at this node */
 	List	   *qual;			/* implicitly-ANDed qual conditions */
+	struct PlanState *parent;   /* parent node, NULL if root */
 	struct PlanState *lefttree; /* input plan tree(s) */
 	struct PlanState *righttree;
 	List	   *initPlan;		/* Init SubPlanState nodes (un-correlated expr
-- 
2.11.0

0002-Nodes-interface-functions-stubbed.patchtext/x-diffDownload
From 05c7ca5cee31ff7ae4b8949ab42fb6e9d3791f92 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 15:39:12 +0300
Subject: [PATCH 2/8] Nodes interface functions stubbed.

Namely, ExecProcNode, ExecInitNode, ExecEndNode, MultiExecProcNode, ExecRescan,
ExecutorRewind. It breaks the existing executor.
---
 src/backend/executor/execAmi.c      | 213 +-----------
 src/backend/executor/execMain.c     |  26 +-
 src/backend/executor/execProcnode.c | 635 +-----------------------------------
 3 files changed, 16 insertions(+), 858 deletions(-)

diff --git a/src/backend/executor/execAmi.c b/src/backend/executor/execAmi.c
index 5d59f95a91..a447cb92ba 100644
--- a/src/backend/executor/execAmi.c
+++ b/src/backend/executor/execAmi.c
@@ -73,218 +73,7 @@ static bool IndexSupportsBackwardScan(Oid indexid);
 void
 ExecReScan(PlanState *node)
 {
-	/* If collecting timing stats, update them */
-	if (node->instrument)
-		InstrEndLoop(node->instrument);
-
-	/*
-	 * If we have changed parameters, propagate that info.
-	 *
-	 * Note: ExecReScanSetParamPlan() can add bits to node->chgParam,
-	 * corresponding to the output param(s) that the InitPlan will update.
-	 * Since we make only one pass over the list, that means that an InitPlan
-	 * can depend on the output param(s) of a sibling InitPlan only if that
-	 * sibling appears earlier in the list.  This is workable for now given
-	 * the limited ways in which one InitPlan could depend on another, but
-	 * eventually we might need to work harder (or else make the planner
-	 * enlarge the extParam/allParam sets to include the params of depended-on
-	 * InitPlans).
-	 */
-	if (node->chgParam != NULL)
-	{
-		ListCell   *l;
-
-		foreach(l, node->initPlan)
-		{
-			SubPlanState *sstate = (SubPlanState *) lfirst(l);
-			PlanState  *splan = sstate->planstate;
-
-			if (splan->plan->extParam != NULL)	/* don't care about child
-												 * local Params */
-				UpdateChangedParamSet(splan, node->chgParam);
-			if (splan->chgParam != NULL)
-				ExecReScanSetParamPlan(sstate, node);
-		}
-		foreach(l, node->subPlan)
-		{
-			SubPlanState *sstate = (SubPlanState *) lfirst(l);
-			PlanState  *splan = sstate->planstate;
-
-			if (splan->plan->extParam != NULL)
-				UpdateChangedParamSet(splan, node->chgParam);
-		}
-		/* Well. Now set chgParam for left/right trees. */
-		if (node->lefttree != NULL)
-			UpdateChangedParamSet(node->lefttree, node->chgParam);
-		if (node->righttree != NULL)
-			UpdateChangedParamSet(node->righttree, node->chgParam);
-	}
-
-	/* Call expression callbacks */
-	if (node->ps_ExprContext)
-		ReScanExprContext(node->ps_ExprContext);
-
-	/* And do node-type-specific processing */
-	switch (nodeTag(node))
-	{
-		case T_ResultState:
-			ExecReScanResult((ResultState *) node);
-			break;
-
-		case T_ProjectSetState:
-			ExecReScanProjectSet((ProjectSetState *) node);
-			break;
-
-		case T_ModifyTableState:
-			ExecReScanModifyTable((ModifyTableState *) node);
-			break;
-
-		case T_AppendState:
-			ExecReScanAppend((AppendState *) node);
-			break;
-
-		case T_MergeAppendState:
-			ExecReScanMergeAppend((MergeAppendState *) node);
-			break;
-
-		case T_RecursiveUnionState:
-			ExecReScanRecursiveUnion((RecursiveUnionState *) node);
-			break;
-
-		case T_BitmapAndState:
-			ExecReScanBitmapAnd((BitmapAndState *) node);
-			break;
-
-		case T_BitmapOrState:
-			ExecReScanBitmapOr((BitmapOrState *) node);
-			break;
-
-		case T_SeqScanState:
-			ExecReScanSeqScan((SeqScanState *) node);
-			break;
-
-		case T_SampleScanState:
-			ExecReScanSampleScan((SampleScanState *) node);
-			break;
-
-		case T_GatherState:
-			ExecReScanGather((GatherState *) node);
-			break;
-
-		case T_IndexScanState:
-			ExecReScanIndexScan((IndexScanState *) node);
-			break;
-
-		case T_IndexOnlyScanState:
-			ExecReScanIndexOnlyScan((IndexOnlyScanState *) node);
-			break;
-
-		case T_BitmapIndexScanState:
-			ExecReScanBitmapIndexScan((BitmapIndexScanState *) node);
-			break;
-
-		case T_BitmapHeapScanState:
-			ExecReScanBitmapHeapScan((BitmapHeapScanState *) node);
-			break;
-
-		case T_TidScanState:
-			ExecReScanTidScan((TidScanState *) node);
-			break;
-
-		case T_SubqueryScanState:
-			ExecReScanSubqueryScan((SubqueryScanState *) node);
-			break;
-
-		case T_FunctionScanState:
-			ExecReScanFunctionScan((FunctionScanState *) node);
-			break;
-
-		case T_TableFuncScanState:
-			ExecReScanTableFuncScan((TableFuncScanState *) node);
-			break;
-
-		case T_ValuesScanState:
-			ExecReScanValuesScan((ValuesScanState *) node);
-			break;
-
-		case T_CteScanState:
-			ExecReScanCteScan((CteScanState *) node);
-			break;
-
-		case T_WorkTableScanState:
-			ExecReScanWorkTableScan((WorkTableScanState *) node);
-			break;
-
-		case T_ForeignScanState:
-			ExecReScanForeignScan((ForeignScanState *) node);
-			break;
-
-		case T_CustomScanState:
-			ExecReScanCustomScan((CustomScanState *) node);
-			break;
-
-		case T_NestLoopState:
-			ExecReScanNestLoop((NestLoopState *) node);
-			break;
-
-		case T_MergeJoinState:
-			ExecReScanMergeJoin((MergeJoinState *) node);
-			break;
-
-		case T_HashJoinState:
-			ExecReScanHashJoin((HashJoinState *) node);
-			break;
-
-		case T_MaterialState:
-			ExecReScanMaterial((MaterialState *) node);
-			break;
-
-		case T_SortState:
-			ExecReScanSort((SortState *) node);
-			break;
-
-		case T_GroupState:
-			ExecReScanGroup((GroupState *) node);
-			break;
-
-		case T_AggState:
-			ExecReScanAgg((AggState *) node);
-			break;
-
-		case T_WindowAggState:
-			ExecReScanWindowAgg((WindowAggState *) node);
-			break;
-
-		case T_UniqueState:
-			ExecReScanUnique((UniqueState *) node);
-			break;
-
-		case T_HashState:
-			ExecReScanHash((HashState *) node);
-			break;
-
-		case T_SetOpState:
-			ExecReScanSetOp((SetOpState *) node);
-			break;
-
-		case T_LockRowsState:
-			ExecReScanLockRows((LockRowsState *) node);
-			break;
-
-		case T_LimitState:
-			ExecReScanLimit((LimitState *) node);
-			break;
-
-		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			break;
-	}
-
-	if (node->chgParam != NULL)
-	{
-		bms_free(node->chgParam);
-		node->chgParam = NULL;
-	}
+	elog(ERROR, "ExecReScan not implemented yet");
 }
 
 /*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a53709bba0..a465d74eab 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -518,30 +518,8 @@ standard_ExecutorEnd(QueryDesc *queryDesc)
 void
 ExecutorRewind(QueryDesc *queryDesc)
 {
-	EState	   *estate;
-	MemoryContext oldcontext;
-
-	/* sanity checks */
-	Assert(queryDesc != NULL);
-
-	estate = queryDesc->estate;
-
-	Assert(estate != NULL);
-
-	/* It's probably not sensible to rescan updating queries */
-	Assert(queryDesc->operation == CMD_SELECT);
-
-	/*
-	 * Switch into per-query memory context
-	 */
-	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
-
-	/*
-	 * rescan plan
-	 */
-	ExecReScan(queryDesc->planstate);
-
-	MemoryContextSwitchTo(oldcontext);
+	elog(ERROR, "Rewinding not supported");
+	return;
 }
 
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c1c4cecd6c..b013a17023 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -131,7 +131,7 @@
  *		  'node' is the current node of the plan produced by the query planner
  *		  'estate' is the shared execution state for the plan tree
  *		  'eflags' is a bitwise OR of flag bits described in executor.h
- *        'parent' is parent of the node
+ *		  'parent' is parent of the node
  *
  *		Returns a PlanState node corresponding to the given Plan node.
  * ------------------------------------------------------------------------
@@ -140,8 +140,6 @@ PlanState *
 ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 {
 	PlanState  *result;
-	List	   *subps;
-	ListCell   *l;
 
 	/*
 	 * do nothing when we get to the end of a leaf on tree.
@@ -151,229 +149,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 
 	switch (nodeTag(node))
 	{
-			/*
-			 * control nodes
-			 */
-		case T_Result:
-			result = (PlanState *) ExecInitResult((Result *) node,
-												  estate, eflags);
-			break;
-
-		case T_ProjectSet:
-			result = (PlanState *) ExecInitProjectSet((ProjectSet *) node,
-													  estate, eflags);
-			break;
-
-		case T_ModifyTable:
-			result = (PlanState *) ExecInitModifyTable((ModifyTable *) node,
-													   estate, eflags);
-			break;
-
-		case T_Append:
-			result = (PlanState *) ExecInitAppend((Append *) node,
-												  estate, eflags);
-			break;
-
-		case T_MergeAppend:
-			result = (PlanState *) ExecInitMergeAppend((MergeAppend *) node,
-													   estate, eflags);
-			break;
-
-		case T_RecursiveUnion:
-			result = (PlanState *) ExecInitRecursiveUnion((RecursiveUnion *) node,
-														  estate, eflags);
-			break;
-
-		case T_BitmapAnd:
-			result = (PlanState *) ExecInitBitmapAnd((BitmapAnd *) node,
-													 estate, eflags);
-			break;
-
-		case T_BitmapOr:
-			result = (PlanState *) ExecInitBitmapOr((BitmapOr *) node,
-													estate, eflags);
-			break;
-
-			/*
-			 * scan nodes
-			 */
-		case T_SeqScan:
-			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
-												   estate, eflags);
-			break;
-
-		case T_SampleScan:
-			result = (PlanState *) ExecInitSampleScan((SampleScan *) node,
-													  estate, eflags);
-			break;
-
-		case T_IndexScan:
-			result = (PlanState *) ExecInitIndexScan((IndexScan *) node,
-													 estate, eflags);
-			break;
-
-		case T_IndexOnlyScan:
-			result = (PlanState *) ExecInitIndexOnlyScan((IndexOnlyScan *) node,
-														 estate, eflags);
-			break;
-
-		case T_BitmapIndexScan:
-			result = (PlanState *) ExecInitBitmapIndexScan((BitmapIndexScan *) node,
-														   estate, eflags);
-			break;
-
-		case T_BitmapHeapScan:
-			result = (PlanState *) ExecInitBitmapHeapScan((BitmapHeapScan *) node,
-														  estate, eflags);
-			break;
-
-		case T_TidScan:
-			result = (PlanState *) ExecInitTidScan((TidScan *) node,
-												   estate, eflags);
-			break;
-
-		case T_SubqueryScan:
-			result = (PlanState *) ExecInitSubqueryScan((SubqueryScan *) node,
-														estate, eflags);
-			break;
-
-		case T_FunctionScan:
-			result = (PlanState *) ExecInitFunctionScan((FunctionScan *) node,
-														estate, eflags);
-			break;
-
-		case T_TableFuncScan:
-			result = (PlanState *) ExecInitTableFuncScan((TableFuncScan *) node,
-														 estate, eflags);
-			break;
-
-		case T_ValuesScan:
-			result = (PlanState *) ExecInitValuesScan((ValuesScan *) node,
-													  estate, eflags);
-			break;
-
-		case T_CteScan:
-			result = (PlanState *) ExecInitCteScan((CteScan *) node,
-												   estate, eflags);
-			break;
-
-		case T_WorkTableScan:
-			result = (PlanState *) ExecInitWorkTableScan((WorkTableScan *) node,
-														 estate, eflags);
-			break;
-
-		case T_ForeignScan:
-			result = (PlanState *) ExecInitForeignScan((ForeignScan *) node,
-													   estate, eflags);
-			break;
-
-		case T_CustomScan:
-			result = (PlanState *) ExecInitCustomScan((CustomScan *) node,
-													  estate, eflags);
-			break;
-
-			/*
-			 * join nodes
-			 */
-		case T_NestLoop:
-			result = (PlanState *) ExecInitNestLoop((NestLoop *) node,
-													estate, eflags);
-			break;
-
-		case T_MergeJoin:
-			result = (PlanState *) ExecInitMergeJoin((MergeJoin *) node,
-													 estate, eflags);
-			break;
-
-		case T_HashJoin:
-			result = (PlanState *) ExecInitHashJoin((HashJoin *) node,
-													estate, eflags);
-			break;
-
-			/*
-			 * materialization nodes
-			 */
-		case T_Material:
-			result = (PlanState *) ExecInitMaterial((Material *) node,
-													estate, eflags);
-			break;
-
-		case T_Sort:
-			result = (PlanState *) ExecInitSort((Sort *) node,
-												estate, eflags);
-			break;
-
-		case T_Group:
-			result = (PlanState *) ExecInitGroup((Group *) node,
-												 estate, eflags);
-			break;
-
-		case T_Agg:
-			result = (PlanState *) ExecInitAgg((Agg *) node,
-											   estate, eflags);
-			break;
-
-		case T_WindowAgg:
-			result = (PlanState *) ExecInitWindowAgg((WindowAgg *) node,
-													 estate, eflags);
-			break;
-
-		case T_Unique:
-			result = (PlanState *) ExecInitUnique((Unique *) node,
-												  estate, eflags);
-			break;
-
-		case T_Gather:
-			result = (PlanState *) ExecInitGather((Gather *) node,
-												  estate, eflags);
-			break;
-
-		case T_GatherMerge:
-			result = (PlanState *) ExecInitGatherMerge((GatherMerge *) node,
-													   estate, eflags);
-			break;
-
-		case T_Hash:
-			result = (PlanState *) ExecInitHash((Hash *) node,
-												estate, eflags);
-			break;
-
-		case T_SetOp:
-			result = (PlanState *) ExecInitSetOp((SetOp *) node,
-												 estate, eflags);
-			break;
-
-		case T_LockRows:
-			result = (PlanState *) ExecInitLockRows((LockRows *) node,
-													estate, eflags);
-			break;
-
-		case T_Limit:
-			result = (PlanState *) ExecInitLimit((Limit *) node,
-												 estate, eflags);
-			break;
-
 		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;		/* keep compiler quiet */
-			break;
-	}
-
-	/*
-	 * Initialize any initPlans present in this node.  The planner put them in
-	 * a separate list for us.
-	 */
-	subps = NIL;
-	foreach(l, node->initPlan)
-	{
-		SubPlan    *subplan = (SubPlan *) lfirst(l);
-		SubPlanState *sstate;
-
-		Assert(IsA(subplan, SubPlan));
-		sstate = ExecInitSubPlan(subplan, result);
-		subps = lappend(subps, sstate);
+			elog(ERROR, "unrecognized/unsupported node type: %d",
+				 (int) nodeTag(node));
+			return NULL;		/* keep compiler quiet */
 	}
-	result->initPlan = subps;
 
 	/* Set up instrumentation for this node if requested */
 	if (estate->es_instrument)
@@ -383,253 +163,27 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 }
 
 
-/* ----------------------------------------------------------------
- *		ExecProcNode
- *
- *		Execute the given node to return a(nother) tuple.
- * ----------------------------------------------------------------
+/*
+ * Unsupported, left to avoid deleting 19k lines of existing code
  */
 TupleTableSlot *
 ExecProcNode(PlanState *node)
 {
-	TupleTableSlot *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
-	if (node->instrument)
-		InstrStartNode(node->instrument);
-
-	switch (nodeTag(node))
-	{
-			/*
-			 * control nodes
-			 */
-		case T_ResultState:
-			result = ExecResult((ResultState *) node);
-			break;
-
-		case T_ProjectSetState:
-			result = ExecProjectSet((ProjectSetState *) node);
-			break;
-
-		case T_ModifyTableState:
-			result = ExecModifyTable((ModifyTableState *) node);
-			break;
-
-		case T_AppendState:
-			result = ExecAppend((AppendState *) node);
-			break;
-
-		case T_MergeAppendState:
-			result = ExecMergeAppend((MergeAppendState *) node);
-			break;
-
-		case T_RecursiveUnionState:
-			result = ExecRecursiveUnion((RecursiveUnionState *) node);
-			break;
-
-			/* BitmapAndState does not yield tuples */
-
-			/* BitmapOrState does not yield tuples */
-
-			/*
-			 * scan nodes
-			 */
-		case T_SeqScanState:
-			result = ExecSeqScan((SeqScanState *) node);
-			break;
-
-		case T_SampleScanState:
-			result = ExecSampleScan((SampleScanState *) node);
-			break;
-
-		case T_IndexScanState:
-			result = ExecIndexScan((IndexScanState *) node);
-			break;
-
-		case T_IndexOnlyScanState:
-			result = ExecIndexOnlyScan((IndexOnlyScanState *) node);
-			break;
-
-			/* BitmapIndexScanState does not yield tuples */
-
-		case T_BitmapHeapScanState:
-			result = ExecBitmapHeapScan((BitmapHeapScanState *) node);
-			break;
-
-		case T_TidScanState:
-			result = ExecTidScan((TidScanState *) node);
-			break;
-
-		case T_SubqueryScanState:
-			result = ExecSubqueryScan((SubqueryScanState *) node);
-			break;
-
-		case T_FunctionScanState:
-			result = ExecFunctionScan((FunctionScanState *) node);
-			break;
-
-		case T_TableFuncScanState:
-			result = ExecTableFuncScan((TableFuncScanState *) node);
-			break;
-
-		case T_ValuesScanState:
-			result = ExecValuesScan((ValuesScanState *) node);
-			break;
-
-		case T_CteScanState:
-			result = ExecCteScan((CteScanState *) node);
-			break;
-
-		case T_WorkTableScanState:
-			result = ExecWorkTableScan((WorkTableScanState *) node);
-			break;
-
-		case T_ForeignScanState:
-			result = ExecForeignScan((ForeignScanState *) node);
-			break;
-
-		case T_CustomScanState:
-			result = ExecCustomScan((CustomScanState *) node);
-			break;
-
-			/*
-			 * join nodes
-			 */
-		case T_NestLoopState:
-			result = ExecNestLoop((NestLoopState *) node);
-			break;
-
-		case T_MergeJoinState:
-			result = ExecMergeJoin((MergeJoinState *) node);
-			break;
-
-		case T_HashJoinState:
-			result = ExecHashJoin((HashJoinState *) node);
-			break;
-
-			/*
-			 * materialization nodes
-			 */
-		case T_MaterialState:
-			result = ExecMaterial((MaterialState *) node);
-			break;
-
-		case T_SortState:
-			result = ExecSort((SortState *) node);
-			break;
-
-		case T_GroupState:
-			result = ExecGroup((GroupState *) node);
-			break;
-
-		case T_AggState:
-			result = ExecAgg((AggState *) node);
-			break;
-
-		case T_WindowAggState:
-			result = ExecWindowAgg((WindowAggState *) node);
-			break;
-
-		case T_UniqueState:
-			result = ExecUnique((UniqueState *) node);
-			break;
-
-		case T_GatherState:
-			result = ExecGather((GatherState *) node);
-			break;
-
-		case T_GatherMergeState:
-			result = ExecGatherMerge((GatherMergeState *) node);
-			break;
-
-		case T_HashState:
-			result = ExecHash((HashState *) node);
-			break;
-
-		case T_SetOpState:
-			result = ExecSetOp((SetOpState *) node);
-			break;
-
-		case T_LockRowsState:
-			result = ExecLockRows((LockRowsState *) node);
-			break;
-
-		case T_LimitState:
-			result = ExecLimit((LimitState *) node);
-			break;
-
-		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
-			break;
-	}
-
-	if (node->instrument)
-		InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
-	return result;
+	elog(ERROR, "ExecProcNode is not supported");
+	return NULL;
 }
 
-
 /* ----------------------------------------------------------------
- *		MultiExecProcNode
- *
- *		Execute a node that doesn't return individual tuples
- *		(it might return a hashtable, bitmap, etc).  Caller should
- *		check it got back the expected kind of Node.
- *
- * This has essentially the same responsibilities as ExecProcNode,
- * but it does not do InstrStartNode/InstrStopNode (mainly because
- * it can't tell how many returned tuples to count).  Each per-node
- * function must provide its own instrumentation support.
+ * Unsupported too; we don't need it in push model
  * ----------------------------------------------------------------
  */
 Node *
 MultiExecProcNode(PlanState *node)
 {
-	Node	   *result;
-
-	CHECK_FOR_INTERRUPTS();
-
-	if (node->chgParam != NULL) /* something changed */
-		ExecReScan(node);		/* let ReScan handle this */
-
-	switch (nodeTag(node))
-	{
-			/*
-			 * Only node types that actually support multiexec will be listed
-			 */
-
-		case T_HashState:
-			result = MultiExecHash((HashState *) node);
-			break;
-
-		case T_BitmapIndexScanState:
-			result = MultiExecBitmapIndexScan((BitmapIndexScanState *) node);
-			break;
-
-		case T_BitmapAndState:
-			result = MultiExecBitmapAnd((BitmapAndState *) node);
-			break;
-
-		case T_BitmapOrState:
-			result = MultiExecBitmapOr((BitmapOrState *) node);
-			break;
-
-		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
-			result = NULL;
-			break;
-	}
-
-	return result;
+	elog(ERROR, "MultiExecProcNode is not supported");
+	return NULL;
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecEndNode
  *
@@ -658,172 +212,9 @@ ExecEndNode(PlanState *node)
 
 	switch (nodeTag(node))
 	{
-			/*
-			 * control nodes
-			 */
-		case T_ResultState:
-			ExecEndResult((ResultState *) node);
-			break;
-
-		case T_ProjectSetState:
-			ExecEndProjectSet((ProjectSetState *) node);
-			break;
-
-		case T_ModifyTableState:
-			ExecEndModifyTable((ModifyTableState *) node);
-			break;
-
-		case T_AppendState:
-			ExecEndAppend((AppendState *) node);
-			break;
-
-		case T_MergeAppendState:
-			ExecEndMergeAppend((MergeAppendState *) node);
-			break;
-
-		case T_RecursiveUnionState:
-			ExecEndRecursiveUnion((RecursiveUnionState *) node);
-			break;
-
-		case T_BitmapAndState:
-			ExecEndBitmapAnd((BitmapAndState *) node);
-			break;
-
-		case T_BitmapOrState:
-			ExecEndBitmapOr((BitmapOrState *) node);
-			break;
-
-			/*
-			 * scan nodes
-			 */
-		case T_SeqScanState:
-			ExecEndSeqScan((SeqScanState *) node);
-			break;
-
-		case T_SampleScanState:
-			ExecEndSampleScan((SampleScanState *) node);
-			break;
-
-		case T_GatherState:
-			ExecEndGather((GatherState *) node);
-			break;
-
-		case T_GatherMergeState:
-			ExecEndGatherMerge((GatherMergeState *) node);
-			break;
-
-		case T_IndexScanState:
-			ExecEndIndexScan((IndexScanState *) node);
-			break;
-
-		case T_IndexOnlyScanState:
-			ExecEndIndexOnlyScan((IndexOnlyScanState *) node);
-			break;
-
-		case T_BitmapIndexScanState:
-			ExecEndBitmapIndexScan((BitmapIndexScanState *) node);
-			break;
-
-		case T_BitmapHeapScanState:
-			ExecEndBitmapHeapScan((BitmapHeapScanState *) node);
-			break;
-
-		case T_TidScanState:
-			ExecEndTidScan((TidScanState *) node);
-			break;
-
-		case T_SubqueryScanState:
-			ExecEndSubqueryScan((SubqueryScanState *) node);
-			break;
-
-		case T_FunctionScanState:
-			ExecEndFunctionScan((FunctionScanState *) node);
-			break;
-
-		case T_TableFuncScanState:
-			ExecEndTableFuncScan((TableFuncScanState *) node);
-			break;
-
-		case T_ValuesScanState:
-			ExecEndValuesScan((ValuesScanState *) node);
-			break;
-
-		case T_CteScanState:
-			ExecEndCteScan((CteScanState *) node);
-			break;
-
-		case T_WorkTableScanState:
-			ExecEndWorkTableScan((WorkTableScanState *) node);
-			break;
-
-		case T_ForeignScanState:
-			ExecEndForeignScan((ForeignScanState *) node);
-			break;
-
-		case T_CustomScanState:
-			ExecEndCustomScan((CustomScanState *) node);
-			break;
-
-			/*
-			 * join nodes
-			 */
-		case T_NestLoopState:
-			ExecEndNestLoop((NestLoopState *) node);
-			break;
-
-		case T_MergeJoinState:
-			ExecEndMergeJoin((MergeJoinState *) node);
-			break;
-
-		case T_HashJoinState:
-			ExecEndHashJoin((HashJoinState *) node);
-			break;
-
-			/*
-			 * materialization nodes
-			 */
-		case T_MaterialState:
-			ExecEndMaterial((MaterialState *) node);
-			break;
-
-		case T_SortState:
-			ExecEndSort((SortState *) node);
-			break;
-
-		case T_GroupState:
-			ExecEndGroup((GroupState *) node);
-			break;
-
-		case T_AggState:
-			ExecEndAgg((AggState *) node);
-			break;
-
-		case T_WindowAggState:
-			ExecEndWindowAgg((WindowAggState *) node);
-			break;
-
-		case T_UniqueState:
-			ExecEndUnique((UniqueState *) node);
-			break;
-
-		case T_HashState:
-			ExecEndHash((HashState *) node);
-			break;
-
-		case T_SetOpState:
-			ExecEndSetOp((SetOpState *) node);
-			break;
-
-		case T_LockRowsState:
-			ExecEndLockRows((LockRowsState *) node);
-			break;
-
-		case T_LimitState:
-			ExecEndLimit((LimitState *) node);
-			break;
-
 		default:
-			elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
+			elog(ERROR, "unrecognized/unsupported node type: %d",
+				 (int) nodeTag(node));
 			break;
 	}
 }
-- 
2.11.0

0003-Base-for-reversed-executor.patchtext/x-diffDownload
From 372629e30dfb40f357b46f3644e090de945e0cb0 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 17:26:26 +0300
Subject: [PATCH 3/8] Base for reversed executor.

Framework for implementing reversed executor. Substitutes ExecutePlan call
with RunNode, which invokes ExecLeaf on leaf nodes in proper order.

See README and the beginning of execProcnode.c for the details.
---
 src/backend/executor/README         |  47 +++++++
 src/backend/executor/execMain.c     | 273 +++++++++++++++++-------------------
 src/backend/executor/execProcnode.c | 152 ++++++++++++++++----
 src/include/executor/executor.h     |   4 +
 src/include/nodes/execnodes.h       |  11 ++
 5 files changed, 315 insertions(+), 172 deletions(-)

diff --git a/src/backend/executor/README b/src/backend/executor/README
index f1d1e4c76c..08f139d0c5 100644
--- a/src/backend/executor/README
+++ b/src/backend/executor/README
@@ -3,6 +3,53 @@ src/backend/executor/README
 The Postgres Executor
 =====================
 
+This is an attempt to implement proof concept of executor with push-based
+achitecture like in [1]. We will call it 'reversed' executor. Right now we will
+not support both reversed and original executor, because it would involve a lot
+of code copy-pasting (or time to avoid it), while our current goal is just to
+implement working proof of concept to estimate the benefits.
+
+Since this is a huge change, we need to outline the general strategy, things
+we will start with and how we will deal with the old code, remembering that we
+will reuse a great deal of it.
+
+Key points:
+* ExecProcNode is now a stub. All nodes code (ExecSomeNode, etc) is
+  unreachable. However, we leave it to avoid 19k lines removal commit and to
+  produce more useful diffs later; a lot of code will be reused.
+* Base for implementing push model, common for all nodes, is in execMain.c and
+  execProcNode.c. We will substitute execProcNode with functions ExecLeaf,
+  ExecPushTuple and ExecPushNull, theirs interface is described in the comment
+  to the definitions and in the beginning of execProcnode.c. This is the only
+  change to the node's interface. We make necessary changes to execMain.c,
+  namely to ExecutorRun, to run nodes in proper order from the below.
+* Then we are ready to implement the nodes one by one.
+
+At first,
+* parallel execution will not be supported.
+* subplans will not be supported.
+* we will not support ExecReScan too for now.
+* only CMD_SELECT operation will be supported.
+* only forward direction will be supported.
+* we will not support set returning functions either.
+
+In general, we try to treat the old code as follows:
+* As said above, leave it if the old code is dead and not yet rewritten.
+  Compile with -Wno-unused-function, if warnings are annoying.
+* If it is not dead, but not yet updated for reversed executor, remove it.
+  Example is contents of ExecInitNode.
+* Sometimes we need to make minimal changes to some existing function, but these
+  changes will make it incompatible with existing code which is not yet
+  reworked.  In that case, to avoid deleting a lot of code we will just
+  copypaste it until some more generic solution will be provided. Example is
+  heapgettup_pagemode and it's 'reversed' analogue added for seqscan.
+
+
+[1] Efficiently Compiling Efficient Query Plans for Modern Hardware,
+    http://www.vldb.org/pvldb/vol4/p539-neumann.pdf
+
+Below goes the original README text.
+
 The executor processes a tree of "plan nodes".  The plan tree is essentially
 a demand-pull pipeline of tuple processing operations.  Each node, when
 called, will produce the next tuple in its output sequence, or NULL if no
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a465d74eab..ec8aba660d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -63,6 +63,7 @@
 #include "utils/ruleutils.h"
 #include "utils/snapmgr.h"
 #include "utils/tqual.h"
+#include "executor/executor.h"
 
 
 /* Hooks for plugins to get control in ExecutorStart/Run/Finish/End */
@@ -79,14 +80,7 @@ static void InitPlan(QueryDesc *queryDesc, int eflags);
 static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
 static void ExecPostprocessPlan(EState *estate);
 static void ExecEndPlan(PlanState *planstate, EState *estate);
-static void ExecutePlan(EState *estate, PlanState *planstate,
-			bool use_parallel_mode,
-			CmdType operation,
-			bool sendTuples,
-			uint64 numberTuples,
-			ScanDirection direction,
-			DestReceiver *dest,
-			bool execute_once);
+static void RunNode(PlanState *planstate);
 static bool ExecCheckRTEPerms(RangeTblEntry *rte);
 static bool ExecCheckRTEPermsModified(Oid relOid, Oid userid,
 						  Bitmapset *modifiedCols,
@@ -329,6 +323,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	 * extract information from the query descriptor and the query feature.
 	 */
 	operation = queryDesc->operation;
+	if (operation != CMD_SELECT)
+	{
+		elog(ERROR, "Non CMD_SELECT operations are not implemented");
+		return;
+	}
 	dest = queryDesc->dest;
 
 	/*
@@ -343,25 +342,24 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	if (sendTuples)
 		(*dest->rStartup) (dest, operation, queryDesc->tupDesc);
 
+	/* set up state needed for sending tuples to the dest */
+	estate->es_current_tuple_count = 0;
+	estate->es_sendTuples = sendTuples;
+	estate->es_numberTuplesRequested = count;
+	estate->es_operation = operation;
+	estate->es_dest = dest;
+
+	/*
+	 * Set the direction.
+	 */
+	estate->es_direction = direction;
+
 	/*
 	 * run plan
 	 */
 	if (!ScanDirectionIsNoMovement(direction))
-	{
-		if (execute_once && queryDesc->already_executed)
-			elog(ERROR, "can't re-execute query flagged for single execution");
-		queryDesc->already_executed = true;
-
-		ExecutePlan(estate,
-					queryDesc->planstate,
-					queryDesc->plannedstmt->parallelModeNeeded,
-					operation,
-					sendTuples,
-					count,
-					direction,
-					dest,
-					execute_once);
-	}
+		/* Run each leaf in right order	 */
+		RunNode(queryDesc->planstate);
 
 	/*
 	 * shutdown tuple receiver, if we started it
@@ -1562,131 +1560,6 @@ ExecEndPlan(PlanState *planstate, EState *estate)
 	}
 }
 
-/* ----------------------------------------------------------------
- *		ExecutePlan
- *
- *		Processes the query plan until we have retrieved 'numberTuples' tuples,
- *		moving in the specified direction.
- *
- *		Runs to completion if numberTuples is 0
- *
- * Note: the ctid attribute is a 'junk' attribute that is removed before the
- * user can see it
- * ----------------------------------------------------------------
- */
-static void
-ExecutePlan(EState *estate,
-			PlanState *planstate,
-			bool use_parallel_mode,
-			CmdType operation,
-			bool sendTuples,
-			uint64 numberTuples,
-			ScanDirection direction,
-			DestReceiver *dest,
-			bool execute_once)
-{
-	TupleTableSlot *slot;
-	uint64		current_tuple_count;
-
-	/*
-	 * initialize local variables
-	 */
-	current_tuple_count = 0;
-
-	/*
-	 * Set the direction.
-	 */
-	estate->es_direction = direction;
-
-	/*
-	 * If the plan might potentially be executed multiple times, we must force
-	 * it to run without parallelism, because we might exit early.  Also
-	 * disable parallelism when writing into a relation, because no database
-	 * changes are allowed in parallel mode.
-	 */
-	if (!execute_once || dest->mydest == DestIntoRel)
-		use_parallel_mode = false;
-
-	if (use_parallel_mode)
-		EnterParallelMode();
-
-	/*
-	 * Loop until we've processed the proper number of tuples from the plan.
-	 */
-	for (;;)
-	{
-		/* Reset the per-output-tuple exprcontext */
-		ResetPerTupleExprContext(estate);
-
-		/*
-		 * Execute the plan and obtain a tuple
-		 */
-		slot = ExecProcNode(planstate);
-
-		/*
-		 * if the tuple is null, then we assume there is nothing more to
-		 * process so we just end the loop...
-		 */
-		if (TupIsNull(slot))
-		{
-			/* Allow nodes to release or shut down resources. */
-			(void) ExecShutdownNode(planstate);
-			break;
-		}
-
-		/*
-		 * If we have a junk filter, then project a new tuple with the junk
-		 * removed.
-		 *
-		 * Store this new "clean" tuple in the junkfilter's resultSlot.
-		 * (Formerly, we stored it back over the "dirty" tuple, which is WRONG
-		 * because that tuple slot has the wrong descriptor.)
-		 */
-		if (estate->es_junkFilter != NULL)
-			slot = ExecFilterJunk(estate->es_junkFilter, slot);
-
-		/*
-		 * If we are supposed to send the tuple somewhere, do so. (In
-		 * practice, this is probably always the case at this point.)
-		 */
-		if (sendTuples)
-		{
-			/*
-			 * If we are not able to send the tuple, we assume the destination
-			 * has closed and no more tuples can be sent. If that's the case,
-			 * end the loop.
-			 */
-			if (!((*dest->receiveSlot) (slot, dest)))
-				break;
-		}
-
-		/*
-		 * Count tuples processed, if this is a SELECT.  (For other operation
-		 * types, the ModifyTable plan node must count the appropriate
-		 * events.)
-		 */
-		if (operation == CMD_SELECT)
-			(estate->es_processed)++;
-
-		/*
-		 * check our tuple count.. if we've processed the proper number then
-		 * quit, else loop again and process more tuples.  Zero numberTuples
-		 * means no limit.
-		 */
-		current_tuple_count++;
-		if (numberTuples && numberTuples == current_tuple_count)
-		{
-			/* Allow nodes to release or shut down resources. */
-			(void) ExecShutdownNode(planstate);
-			break;
-		}
-	}
-
-	if (use_parallel_mode)
-		ExitParallelMode();
-}
-
-
 /*
  * ExecRelCheck --- check that tuple meets constraints for result relation
  *
@@ -3325,3 +3198,107 @@ ExecBuildSlotPartitionKeyDescription(Relation rel,
 
 	return buf.data;
 }
+
+/*
+ * This function pushes the ready tuple to it's destination. It should
+ * be called by top-level PlanState.
+ * For now, I added the state needed for this to estate, specifically
+ * current_tuple_count, sendTuples, numberTuplesRequested (old numberTuples),
+ * cmdType, dest.
+ *
+ * slot is the tuple to push
+ * planstate is top-level node
+ * returns true, if we are ready to accept more tuples, false otherwise
+ */
+bool
+SendReadyTuple(TupleTableSlot *slot, PlanState *planstate)
+{
+	EState *estate;
+	bool sendTuples;
+	CmdType operation;
+	DestReceiver *dest;
+
+	estate = planstate->state;
+	sendTuples = estate->es_sendTuples;
+	operation = estate->es_operation;
+	dest = estate->es_dest;
+
+	if (TupIsNull(slot))
+	{
+		/* Allow nodes to release or shut down resources. */
+		(void) ExecShutdownNode(planstate);
+		return false;
+	}
+
+	/*
+	 * If we have a junk filter, then project a new tuple with the junk
+	 * removed.
+	 *
+	 * Store this new "clean" tuple in the junkfilter's resultSlot.
+	 * (Formerly, we stored it back over the "dirty" tuple, which is WRONG
+	 * because that tuple slot has the wrong descriptor.)
+	 */
+	if (estate->es_junkFilter != NULL)
+		slot = ExecFilterJunk(estate->es_junkFilter, slot);
+
+	/*
+	 * If we are supposed to send the tuple somewhere, do so. (In
+	 * practice, this is probably always the case at this point.)
+	 */
+	if (sendTuples)
+	{
+		/*
+		 * If we are not able to send the tuple, we assume the destination
+		 * has closed and no more tuples can be sent.
+		 */
+		if (!((*dest->receiveSlot) (slot, dest)))
+			return false;
+	}
+
+	/*
+	 * Count tuples processed, if this is a SELECT.  (For other operation
+	 * types, the ModifyTable plan node must count the appropriate
+	 * events.)
+	 */
+	if (operation == CMD_SELECT)
+		(estate->es_processed)++;
+
+	/*
+	 * check our tuple count.. if we've processed the proper number then
+	 * quit, else process more tuples.  Zero numberTuplesRequested
+	 * means no limit.
+	 */
+	estate->es_current_tuple_count++;
+	if (estate->es_numberTuplesRequested &&
+		estate->es_numberTuplesRequested == estate->es_current_tuple_count)
+		return false;
+
+	ResetPerTupleExprContext(estate);
+	return true;
+}
+
+/*
+ * When pushing, we have to call pushTuple on each leaf of the tree in correct
+ * order: first inner sides, then outer. This function does exactly that.
+ */
+void
+RunNode(PlanState *planstate)
+{
+	Assert(planstate != NULL);
+
+	if (innerPlanState(planstate) != NULL)
+	{
+		RunNode(innerPlanState(planstate));
+		/* I assume that if inner node exists, outer exists too */
+		RunNode(outerPlanState(planstate));
+		return;
+	}
+	if (outerPlanState(planstate) != NULL)
+	{
+		RunNode(outerPlanState(planstate));
+		return;
+	}
+
+	/* node has no childs, it is a leaf */
+	ExecLeaf(planstate);
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index b013a17023..5955f84d86 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -2,10 +2,10 @@
  *
  * execProcnode.c
  *	 contains dispatch functions which call the appropriate "initialize",
- *	 "get a tuple", and "cleanup" routines for the given node type.
- *	 If the node has children, then it will presumably call ExecInitNode,
- *	 ExecProcNode, or ExecEndNode on its subnodes and do the appropriate
- *	 processing.
+ *	 "push a tuple", and "cleanup" routines for the given node type.
+ *	 If the node has children, then it will presumably call ExecInitNode
+ *	 and ExecEndNode on its subnodes and ExecPushTuple to push processed tuple
+ *	 to its parent.
  *
  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -19,16 +19,18 @@
 /*
  *	 INTERFACE ROUTINES
  *		ExecInitNode	-		initialize a plan node and its subplans
- *		ExecProcNode	-		get a tuple by executing the plan node
+ *		ExecLeaf		-		start execution of the leaf
+ *		ExecPushTuple	-		push tuple to the parent node
+ *		ExecPushNull	-		let parent know that we are done
  *		ExecEndNode		-		shut down a plan node and its subplans
  *
  *	 NOTES
- *		This used to be three files.  It is now all combined into
- *		one file so that it is easier to keep ExecInitNode, ExecProcNode,
- *		and ExecEndNode in sync when new nodes are added.
+ *		This used to be three files. It is now all combined into
+ *		one file so that it is easier to keep ExecInitNode, ExecLeaf,
+ *		ExecPushTuple, ExecPushNull and ExecEndNode in sync when new nodes
+ *		are added.
  *
- *	 EXAMPLE
- *		Suppose we want the age of the manager of the shoe department and
+ *	 EXAMPLE Suppose we want the age of the manager of the shoe department and
  *		the number of employees in that department.  So we have the query:
  *
  *				select DEPT.no_emps, EMP.age
@@ -56,24 +58,47 @@
  *		of ExecInitNode() is a plan state tree built with the same structure
  *		as the underlying plan tree.
  *
- *	  * Then when ExecutorRun() is called, it calls ExecutePlan() which calls
- *		ExecProcNode() repeatedly on the top node of the plan state tree.
- *		Each time this happens, ExecProcNode() will end up calling
- *		ExecNestLoop(), which calls ExecProcNode() on its subplans.
- *		Each of these subplans is a sequential scan so ExecSeqScan() is
- *		called.  The slots returned by ExecSeqScan() may contain
- *		tuples which contain the attributes ExecNestLoop() uses to
- *		form the tuples it returns.
+ *	  * Then when ExecutorRun() is called, it calls ExecLeaf on each leaf of
+ *		the plan state with inner leafs first, outer second. So, in this case
+ *		it will call it on DEPT SeqScan, and then on EMP SeqScan. ExecLeaf
+ *		chooses the corresponding implementation -- here it is ExecSeqScan,
+ *		sequential scan. ExecSeqScan retrieves tuples and for each of them it
+ *		calls ExecPushTuple to pass tuple to nodeSeqScan's parent, Nest Loop
+ *		in this case. ExecPushTuple resolves the call, i.e. it finds something
+ *		like ExecPushTupleToNestLoopFromOuter and calls it. Then the process
+ *		repeats, so ExecPushTuple is called recursively. We have two corner
+ *		cases:
  *
- *	  * Eventually ExecSeqScan() stops returning tuples and the nest
- *		loop join ends.  Lastly, ExecutorEnd() calls ExecEndNode() which
+ *		1) When node have nothing more to push, e.g. nodeSeqScan have
+ *		   scanned all the tuples. Then it calls ExecPushNull once to let its
+ *		   parent know that nodeSeqScan have finished its work.
+ *		2) When node has no parent (top-level node). In this case
+ *		   ExecPushTuple calls SendReadyTuple which sends the tuple to its
+ *		   final destination.
+ *
+ *		So, in our example DEPT ExecSeqScan will eventually call ExecPushNull,
+ *		so Nest Loop node will learn that its inner side is done. Then EMP
+ *		SeqScan will start pushing, and inside EMP Seqscan's ExecPushTuple
+ *		Nested Loop will match the tuples and push them to the final
+ *		destination. Eventually EMP Seqscan will call ExecPushNull, inside it
+ *		Nested Loop will call ExecPushNull too and the nest loop join ends.
+ *
+ *		ExecPushTuple returns bool value which tells whether parent still
+ *		accepts the tuples. It allows to stop execution in the middle; e.g. if
+ *		we have node Limit above SeqScan node, the latter needs to scan only
+ *		LIMIT tuples. We don't push anything after receiving false from
+ *		ExecPushTuple; obviously, even if we have pushed all the tuples and
+ *		the last ExecPushTuple call returned false, we don't call
+ *		ExecPushNull.
+ *
+ *		Lastly, ExecutorEnd() calls ExecEndNode() which
  *		calls ExecEndNestLoop() which in turn calls ExecEndNode() on
  *		its subplans which result in ExecEndSeqScan().
  *
- *		This should show how the executor works by having
- *		ExecInitNode(), ExecProcNode() and ExecEndNode() dispatch
- *		their work to the appropriate node support routines which may
- *		in turn call these routines themselves on their subplans.
+ *		This should show how the executor works by having ExecInitNode(),
+ *		ExecLeaf, ExecPushTuple, ExecPushNull and ExecEndNode() dispatch their
+ *		work to the appropriate node support routines which may in turn call
+ *		these routines themselves on their subplans.
  */
 #include "postgres.h"
 
@@ -173,6 +198,85 @@ ExecProcNode(PlanState *node)
 	return NULL;
 }
 
+/*
+ * Tell the 'node' leaf to start the execution
+ */
+void
+ExecLeaf(PlanState *node)
+{
+	CHECK_FOR_INTERRUPTS();
+
+	switch (nodeTag(node))
+	{
+		default:
+			elog(ERROR, "bottom node type not supported: %d",
+				 (int) nodeTag(node));
+	}
+}
+
+/*
+ * Instead of ExecProcNode, here we will have function ExecPushTuple pushing
+ * one tuple.
+ * 'slot' is tuple to push, it must be not null; when node finished its work
+ * it must call ExecPushNull instead.
+ * 'pusher' is sender of a tuple, it's parent is the receiver. We take it as a
+ * param instead of its parent directly because we need it to distinguish
+ * inner and outer pushes.
+ *
+ * Returns true if node is still accepting tuples, false if not.
+ *
+ * If tuple was pushed into a node which returned 'false' before, the
+ * behaviour is undefined, i.e. it is not allowed; we will try to catch such
+ * situations with asserts.
+ */
+bool
+ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
+{
+	PlanState *receiver = pusher->parent;
+
+	Assert(!TupIsNull(slot));
+
+	CHECK_FOR_INTERRUPTS();
+
+	/* If the receiver is NULL, then pusher is top-level node, so we need
+	 * to send the tuple to the dest
+	 */
+	if (receiver == NULL)
+	{
+		return SendReadyTuple(slot, pusher);
+	}
+
+	elog(ERROR, "node type not supported: %d", (int) nodeTag(receiver));
+}
+
+/*
+ * Signal the parent that we are done. Like in ExecPushTuple, sender is param
+ * here because we need to distinguish inner and outer pushes.
+ *
+ * 'slot' must be null tuple. It exists to be able to transfer correct
+ * tupleDesc.
+ */
+void
+ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
+{
+	PlanState *receiver = pusher->parent;
+
+	Assert(TupIsNull(slot));
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * If the receiver is NULL, then pusher is top-level node; end of
+	 * the execution
+	 */
+	if (receiver == NULL)
+	{
+		SendReadyTuple(NULL, pusher);
+	}
+
+	elog(ERROR, "node type not supported: %d", (int) nodeTag(receiver));
+}
+
 /* ----------------------------------------------------------------
  * Unsupported too; we don't need it in push model
  * ----------------------------------------------------------------
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 716fa9dc27..386fcb4c8b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -180,6 +180,7 @@ extern void ExecutorRun(QueryDesc *queryDesc,
 			ScanDirection direction, uint64 count, bool execute_once);
 extern void standard_ExecutorRun(QueryDesc *queryDesc,
 					 ScanDirection direction, uint64 count, bool execute_once);
+extern bool SendReadyTuple(TupleTableSlot *slot, PlanState *planstate);
 extern void ExecutorFinish(QueryDesc *queryDesc);
 extern void standard_ExecutorFinish(QueryDesc *queryDesc);
 extern void ExecutorEnd(QueryDesc *queryDesc);
@@ -241,6 +242,9 @@ extern TupleTableSlot *ExecProcNode(PlanState *node);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern bool ExecShutdownNode(PlanState *node);
+extern void ExecLeaf(PlanState *node);
+extern bool ExecPushTuple(TupleTableSlot *slot, PlanState *pusher);
+extern void ExecPushNull(TupleTableSlot *slot, PlanState *pusher);
 
 /*
  * prototypes from functions in execQual.c
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 738f098b00..da7fd9c7ac 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -28,6 +28,7 @@
 #include "utils/tuplesort.h"
 #include "nodes/tidbitmap.h"
 #include "storage/condition_variable.h"
+#include "tcop/dest.h" /* for DestReceiver type in EState */
 
 
 /* ----------------
@@ -416,6 +417,16 @@ typedef struct EState
 	List	   *es_auxmodifytables;		/* List of secondary ModifyTableStates */
 
 	/*
+	 * State needed to push tuples to dest in push model, technically it is
+	 * local variables from old ExecutePlan
+	 */
+	uint64		es_current_tuple_count;
+	bool		es_sendTuples;
+	uint64		es_numberTuplesRequested;
+	CmdType		es_operation;
+	DestReceiver *es_dest;
+
+	/*
 	 * this ExprContext is for per-output-tuple operations, such as constraint
 	 * checks and index-value computations.  It will be reset for each output
 	 * tuple.  Note that it will be created only if needed.
-- 
2.11.0

0004-Reversed-SeqScan-implementation.patchtext/x-diffDownload
From 9543495bae9d486a81c15087b97bc1d4a1c631df Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Fri, 10 Mar 2017 21:52:01 +0300
Subject: [PATCH 4/8] Reversed SeqScan implementation.

Main job is done by heappushtups func which iterates over tuples and pushes
each. It is mostly copied heapgettup_pagemode, which is left for compatibility.

Each tuple handling (checking quals, etc) is implemented as inline functions in
nodeSeqscan.h.

Since now heapam.h must now about PlanState, some forward decls were added, kind
of ugly.

EvalPlanQual is not supported.
---
 src/backend/access/heap/heapam.c    | 255 ++++++++++++++++++++++++++++++++++++
 src/backend/executor/execProcnode.c |  23 +++-
 src/backend/executor/nodeSeqscan.c  |  75 +++--------
 src/include/access/heapam.h         |   8 ++
 src/include/executor/nodeSeqscan.h  | 121 ++++++++++++++++-
 5 files changed, 424 insertions(+), 58 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b147f6482c..b5ce8aff10 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -73,6 +73,8 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "executor/executor.h"
+#include "executor/nodeSeqscan.h"
 
 
 /* GUC variable */
@@ -9236,3 +9238,256 @@ heap_mask(char *pagedata, BlockNumber blkno)
 		}
 	}
 }
+
+/* ----------------
+ * Fetch tuples, check quals and push them. Modified heapgettup_pagemode,
+ * a lot of copy-pasting.
+ * This function in fact doesn't care about pusher type and func,
+ * although SeqScanState and inlined SeqPushHeapTuple is hardcoded for now
+ * ----------------
+ */
+void
+heappushtups(HeapScanDesc scan,
+			 ScanDirection dir,
+			 int nkeys,
+			 ScanKey key,
+			 SeqScanState *pusher)
+{
+	HeapTuple	tuple = &(scan->rs_ctup);
+	bool		backward = ScanDirectionIsBackward(dir);
+	BlockNumber page;
+	bool		finished;
+	Page		dp;
+	int			lines;
+	int			lineindex;
+	OffsetNumber lineoff;
+	int			linesleft;
+	ItemId		lpp;
+
+	/* no movement is not supported for now */
+	Assert(!ScanDirectionIsNoMovement(dir));
+
+	/*
+	 * calculate next starting lineindex, given scan direction
+	 */
+	if (ScanDirectionIsForward(dir))
+	{
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				tuple->t_data = NULL;
+				SeqPushNull(pusher);
+				return;
+			}
+			if (scan->rs_parallel != NULL)
+			{
+				page = heap_parallelscan_nextpage(scan);
+
+				/* Other processes might have already finished the scan. */
+				if (page == InvalidBlockNumber)
+				{
+					Assert(!BufferIsValid(scan->rs_cbuf));
+					tuple->t_data = NULL;
+					SeqPushNull(pusher);
+					return;
+				}
+			}
+			else
+				page = scan->rs_startblock;		/* first page */
+			heapgetpage(scan, page);
+			lineindex = 0;
+			scan->rs_inited = true;
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+			lineindex = scan->rs_cindex + 1;
+		}
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
+		lines = scan->rs_ntuples;
+		/* page and lineindex now reference the next visible tid */
+
+		linesleft = lines - lineindex;
+	}
+	else /* backward */
+	{
+		/* backward parallel scan not supported */
+		Assert(scan->rs_parallel == NULL);
+
+		if (!scan->rs_inited)
+		{
+			/*
+			 * return null immediately if relation is empty
+			 */
+			if (scan->rs_nblocks == 0 || scan->rs_numblocks == 0)
+			{
+				Assert(!BufferIsValid(scan->rs_cbuf));
+				tuple->t_data = NULL;
+				SeqPushNull(pusher);
+				return;
+			}
+
+			/*
+			 * Disable reporting to syncscan logic in a backwards scan; it's
+			 * not very likely anyone else is doing the same thing at the same
+			 * time, and much more likely that we'll just bollix things for
+			 * forward scanners.
+			 */
+			scan->rs_syncscan = false;
+			/* start from last page of the scan */
+			if (scan->rs_startblock > 0)
+				page = scan->rs_startblock - 1;
+			else
+				page = scan->rs_nblocks - 1;
+			heapgetpage(scan, page);
+		}
+		else
+		{
+			/* continue from previously returned page/tuple */
+			page = scan->rs_cblock;		/* current page */
+		}
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
+		lines = scan->rs_ntuples;
+
+		if (!scan->rs_inited)
+		{
+			lineindex = lines - 1;
+			scan->rs_inited = true;
+		}
+		else
+		{
+			lineindex = scan->rs_cindex - 1;
+		}
+		/* page and lineindex now reference the previous visible tid */
+
+		linesleft = lineindex + 1;
+	}
+
+	/*
+	 * advance the scan until we find a qualifying tuple or run out of stuff
+	 * to scan
+	 */
+	for (;;)
+	{
+		while (linesleft > 0)
+		{
+			bool tuple_qualifies = false;
+
+			lineoff = scan->rs_vistuples[lineindex];
+			lpp = PageGetItemId(dp, lineoff);
+			Assert(ItemIdIsNormal(lpp));
+
+			tuple->t_data = (HeapTupleHeader) PageGetItem((Page) dp, lpp);
+			tuple->t_len = ItemIdGetLength(lpp);
+			ItemPointerSet(&(tuple->t_self), page, lineoff);
+
+			/*
+			 * if current tuple qualifies, push it.
+			 */
+			if (key != NULL)
+			{
+				HeapKeyTest(tuple, RelationGetDescr(scan->rs_rd),
+							nkeys, key, tuple_qualifies);
+			}
+			else
+			{
+				tuple_qualifies = true;
+			}
+
+			if (tuple_qualifies)
+			{
+				/* Push tuple */
+				scan->rs_cindex = lineindex;
+				pgstat_count_heap_getnext(scan->rs_rd);
+				if (!SeqPushHeapTuple(tuple, pusher))
+					return;
+			}
+
+			/*
+			 * and carry on to the next one anyway
+			 */
+			--linesleft;
+			if (backward)
+				--lineindex;
+			else
+				++lineindex;
+		}
+
+		/*
+		 * if we get here, it means we've exhausted the items on this page and
+		 * it's time to move to the next.
+		 */
+		if (backward)
+		{
+			finished = (page == scan->rs_startblock) ||
+				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+			if (page == 0)
+				page = scan->rs_nblocks;
+			page--;
+		}
+		else if (scan->rs_parallel != NULL)
+		{
+			page = heap_parallelscan_nextpage(scan);
+			finished = (page == InvalidBlockNumber);
+		}
+		else
+		{
+			page++;
+			if (page >= scan->rs_nblocks)
+				page = 0;
+			finished = (page == scan->rs_startblock) ||
+				(scan->rs_numblocks != InvalidBlockNumber ? --scan->rs_numblocks == 0 : false);
+
+			/*
+			 * Report our new scan position for synchronization purposes. We
+			 * don't do that when moving backwards, however. That would just
+			 * mess up any other forward-moving scanners.
+			 *
+			 * Note: we do this before checking for end of scan so that the
+			 * final state of the position hint is back at the start of the
+			 * rel.  That's not strictly necessary, but otherwise when you run
+			 * the same query multiple times the starting position would shift
+			 * a little bit backwards on every invocation, which is confusing.
+			 * We don't guarantee any specific ordering in general, though.
+			 */
+			if (scan->rs_syncscan)
+				ss_report_location(scan->rs_rd, page);
+		}
+
+		/*
+		 * return NULL if we've exhausted all the pages
+		 */
+		if (finished)
+		{
+			if (BufferIsValid(scan->rs_cbuf))
+				ReleaseBuffer(scan->rs_cbuf);
+			scan->rs_cbuf = InvalidBuffer;
+			scan->rs_cblock = InvalidBlockNumber;
+			tuple->t_data = NULL;
+			scan->rs_inited = false;
+			SeqPushNull(pusher);
+			return;
+		}
+
+		heapgetpage(scan, page);
+
+		dp = BufferGetPage(scan->rs_cbuf);
+		TestForOldSnapshot(scan->rs_snapshot, scan->rs_rd, dp);
+		lines = scan->rs_ntuples;
+		linesleft = lines;
+		if (backward)
+			lineindex = lines - 1;
+		else
+			lineindex = 0;
+	}
+}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5955f84d86..1b81e30cd3 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -174,6 +174,13 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 
 	switch (nodeTag(node))
 	{
+		/*
+		 * scan nodes
+		 */
+		case T_SeqScan:
+			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
+												   estate, eflags, parent);
+			break;
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
@@ -208,6 +215,10 @@ ExecLeaf(PlanState *node)
 
 	switch (nodeTag(node))
 	{
+		case T_SeqScanState:
+			ExecSeqScan((SeqScanState *) node);
+			break;
+
 		default:
 			elog(ERROR, "bottom node type not supported: %d",
 				 (int) nodeTag(node));
@@ -250,7 +261,7 @@ ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
 }
 
 /*
- * Signal the parent that we are done. Like in ExecPushTuple, sender is param
+ * Signal parent that we are done. Like in ExecPushTuple, sender is param
  * here because we need to distinguish inner and outer pushes.
  *
  * 'slot' must be null tuple. It exists to be able to transfer correct
@@ -271,7 +282,8 @@ ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
 	 */
 	if (receiver == NULL)
 	{
-		SendReadyTuple(NULL, pusher);
+		SendReadyTuple(slot, pusher);
+		return;
 	}
 
 	elog(ERROR, "node type not supported: %d", (int) nodeTag(receiver));
@@ -316,6 +328,13 @@ ExecEndNode(PlanState *node)
 
 	switch (nodeTag(node))
 	{
+		/*
+		 * scan nodes
+		 */
+		case T_SeqScanState:
+			ExecEndSeqScan((SeqScanState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index e61895de0a..babd8f07b1 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -15,7 +15,7 @@
 /*
  * INTERFACE ROUTINES
  *		ExecSeqScan				sequentially scans a relation.
- *		ExecSeqNext				retrieve next tuple in sequential order.
+ *		pushTupleToSeqScan		pushes all tuples to parent node
  *		ExecInitSeqScan			creates and initializes a seqscan node.
  *		ExecEndSeqScan			releases any storage allocated.
  *		ExecReScanSeqScan		rescans the relation
@@ -30,29 +30,25 @@
 #include "executor/execdebug.h"
 #include "executor/nodeSeqscan.h"
 #include "utils/rel.h"
+#include "access/heapam.h"
 
 static void InitScanRelation(SeqScanState *node, EState *estate, int eflags);
-static TupleTableSlot *SeqNext(SeqScanState *node);
 
 /* ----------------------------------------------------------------
  *						Scan Support
  * ----------------------------------------------------------------
  */
 
-/* ----------------------------------------------------------------
- *		SeqNext
- *
- *		This is a workhorse for ExecSeqScan
- * ----------------------------------------------------------------
+/*
+ * Push scanned tuples to the parent. Stop when all tuples are pushed or
+ * the parent told us to stop pushing.
  */
-static TupleTableSlot *
-SeqNext(SeqScanState *node)
+void
+ExecSeqScan(SeqScanState *node)
 {
-	HeapTuple	tuple;
-	HeapScanDesc scandesc;
 	EState	   *estate;
+	HeapScanDesc scandesc;
 	ScanDirection direction;
-	TupleTableSlot *slot;
 
 	/*
 	 * get information from the estate and scan state
@@ -60,8 +56,11 @@ SeqNext(SeqScanState *node)
 	scandesc = node->ss.ss_currentScanDesc;
 	estate = node->ss.ps.state;
 	direction = estate->es_direction;
-	slot = node->ss.ss_ScanTupleSlot;
 
+	/* ExecScanFetch not implemented */
+	Assert(estate->es_epqTuple == NULL);
+
+	/* create scandesc, part of old SeqNext before heap_getnext */
 	if (scandesc == NULL)
 	{
 		/*
@@ -73,30 +72,15 @@ SeqNext(SeqScanState *node)
 								  0, NULL);
 		node->ss.ss_currentScanDesc = scandesc;
 	}
+	Assert(scandesc);
 
-	/*
-	 * get the next tuple from the table
-	 */
-	tuple = heap_getnext(scandesc, direction);
+	/* not-page-at-time not supported for now */
+	Assert(scandesc->rs_pageatatime);
+	heappushtups(scandesc, direction,
+				 scandesc->rs_nkeys,
+				 scandesc->rs_key,
+				 node);
 
-	/*
-	 * save the tuple and the buffer returned to us by the access methods in
-	 * our scan tuple slot and return the slot.  Note: we pass 'false' because
-	 * tuples returned by heap_getnext() are pointers onto disk pages and were
-	 * not created with palloc() and so should not be pfree()'d.  Note also
-	 * that ExecStoreTuple will increment the refcount of the buffer; the
-	 * refcount will not be dropped until the tuple table slot is cleared.
-	 */
-	if (tuple)
-		ExecStoreTuple(tuple,	/* tuple to store */
-					   slot,	/* slot to store in */
-					   scandesc->rs_cbuf,		/* buffer associated with this
-												 * tuple */
-					   false);	/* don't pfree this pointer */
-	else
-		ExecClearTuple(slot);
-
-	return slot;
 }
 
 /*
@@ -113,23 +97,6 @@ SeqRecheck(SeqScanState *node, TupleTableSlot *slot)
 }
 
 /* ----------------------------------------------------------------
- *		ExecSeqScan(node)
- *
- *		Scans the relation sequentially and returns the next qualifying
- *		tuple.
- *		We call the ExecScan() routine and pass it the appropriate
- *		access method functions.
- * ----------------------------------------------------------------
- */
-TupleTableSlot *
-ExecSeqScan(SeqScanState *node)
-{
-	return ExecScan((ScanState *) node,
-					(ExecScanAccessMtd) SeqNext,
-					(ExecScanRecheckMtd) SeqRecheck);
-}
-
-/* ----------------------------------------------------------------
  *		InitScanRelation
  *
  *		Set up to access the scan relation.
@@ -154,13 +121,12 @@ InitScanRelation(SeqScanState *node, EState *estate, int eflags)
 	ExecAssignScanType(&node->ss, RelationGetDescr(currentRelation));
 }
 
-
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
  */
 SeqScanState *
-ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
+ExecInitSeqScan(SeqScan *node, EState *estate, int eflags, PlanState *parent)
 {
 	SeqScanState *scanstate;
 
@@ -177,6 +143,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 	scanstate = makeNode(SeqScanState);
 	scanstate->ss.ps.plan = (Plan *) node;
 	scanstate->ss.ps.state = estate;
+	scanstate->ss.ps.parent = parent;
 
 	/*
 	 * Miscellaneous initialization
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 7e85510d2f..74097ffe50 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -126,6 +126,14 @@ extern void heap_rescan_set_params(HeapScanDesc scan, ScanKey key,
 					 bool allow_strat, bool allow_sync, bool allow_pagemode);
 extern void heap_endscan(HeapScanDesc scan);
 extern HeapTuple heap_getnext(HeapScanDesc scan, ScanDirection direction);
+/* forward decls because now we need to know about PlanState  */
+typedef struct PlanState PlanState;
+typedef struct SeqScanState SeqScanState;
+extern void heappushtups(HeapScanDesc scan,
+						 ScanDirection dir,
+						 int nkeys,
+						 ScanKey key,
+						 SeqScanState *pusher);
 
 extern Size heap_parallelscan_estimate(Snapshot snapshot);
 extern void heap_parallelscan_initialize(ParallelHeapScanDesc target,
diff --git a/src/include/executor/nodeSeqscan.h b/src/include/executor/nodeSeqscan.h
index 92b305e138..f7d69296a9 100644
--- a/src/include/executor/nodeSeqscan.h
+++ b/src/include/executor/nodeSeqscan.h
@@ -15,10 +15,15 @@
 #define NODESEQSCAN_H
 
 #include "access/parallel.h"
+#include "access/relscan.h"
 #include "nodes/execnodes.h"
+#include "executor/executor.h"
+#include "utils/memutils.h"
+#include "miscadmin.h"
 
-extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSeqScan(SeqScanState *node);
+extern SeqScanState *ExecInitSeqScan(SeqScan *node, EState *estate, int eflags,
+									 PlanState *parent);
+extern void ExecSeqScan(SeqScanState *node);
 extern void ExecEndSeqScan(SeqScanState *node);
 extern void ExecReScanSeqScan(SeqScanState *node);
 
@@ -27,4 +32,116 @@ extern void ExecSeqScanEstimate(SeqScanState *node, ParallelContext *pcxt);
 extern void ExecSeqScanInitializeDSM(SeqScanState *node, ParallelContext *pcxt);
 extern void ExecSeqScanInitializeWorker(SeqScanState *node, shm_toc *toc);
 
+/* inline functions decls and implementations */
+static inline void SeqPushNull(SeqScanState *pusher);
+static inline bool SeqPushHeapTuple(HeapTuple tuple, SeqScanState *pusher);
+
+/* push NULL to the parent, signaling that we are done */
+static inline void
+SeqPushNull(SeqScanState *pusher)
+{
+	ProjectionInfo *projInfo;
+	TupleTableSlot *slot;
+
+	projInfo = pusher->ss.ps.ps_ProjInfo;
+	slot = pusher->ss.ss_ScanTupleSlot;
+
+	ExecClearTuple(slot);
+	/*
+	 * being careful to use the projection result slot so it has correct
+	 * tupleDesc.
+	 */
+	if (projInfo)
+		ExecPushNull(ExecClearTuple(projInfo->pi_slot), (PlanState *) pusher);
+	else
+		ExecPushNull(slot, (PlanState *) pusher);
+}
+
+/* Push ready HeapTuple from SeqScanState
+ *
+ * Check qual for the tuple and push it. Tuple must be not NULL.
+ * Returns true, if parent accepts more tuples, false otherwise
+ */
+static inline bool SeqPushHeapTuple(HeapTuple tuple, SeqScanState *pusher)
+{
+	HeapScanDesc scandesc;
+	ExprContext *econtext;
+	List	   *qual;
+	ProjectionInfo *projInfo;
+	TupleTableSlot *slot;
+
+	Assert(tuple->t_data != NULL);
+
+	/*
+	 * Fetch data from node
+	 */
+	qual = pusher->ss.ps.qual;
+	projInfo = pusher->ss.ps.ps_ProjInfo;
+	econtext = pusher->ss.ps.ps_ExprContext;
+	scandesc = pusher->ss.ss_currentScanDesc;
+	slot = pusher->ss.ss_ScanTupleSlot;
+
+	CHECK_FOR_INTERRUPTS();
+
+	/*
+	 * save the tuple and the buffer returned to us by the access methods in
+	 * our scan tuple slot.	 Note: we pass 'false' because tuples returned by
+	 * heap_getnext() are pointers onto disk pages and were not created with
+	 * palloc() and so should not be pfree()'d.	 Note also that ExecStoreTuple
+	 * will increment the refcount of the buffer; the refcount will not be
+	 * dropped until the tuple table slot is cleared.
+	 */
+	ExecStoreTuple(tuple,	/* tuple to store */
+				   slot,	/* slot to store in */
+				   scandesc->rs_cbuf,		/* buffer associated with this
+											 * tuple */
+				   false);	/* don't pfree this pointer */
+
+	/*
+	 * If we have neither a qual to check nor a projection to do, just skip
+	 * all the overhead and push the raw scan tuple.
+	 */
+	if (!qual && !projInfo)
+	{
+		return ExecPushTuple(slot, (PlanState *) pusher);
+	}
+
+	ResetExprContext(econtext);
+	/*
+	 * place the current tuple into the expr context
+	 */
+	econtext->ecxt_scantuple = slot;
+
+	/*
+	 * check that the current tuple satisfies the qual-clause
+	 *
+	 * check for non-nil qual here to avoid a function call to ExecQual()
+	 * when the qual is nil ... saves only a few cycles, but they add up
+	 * ...
+	 */
+	if (!qual || ExecQual(qual, econtext, false))
+	{
+		/*
+		 * Found a satisfactory scan tuple.
+		 */
+		if (projInfo)
+		{
+			/*
+			 * Form a projection tuple, store it in the result tuple slot
+			 * and push
+			 */
+			return ExecPushTuple(ExecProject(projInfo), (PlanState *) pusher);
+		}
+		/*
+		 * Here, we aren't projecting, so just push scan tuple.
+		 */
+		return ExecPushTuple(slot, (PlanState *) pusher);
+	}
+	else
+		InstrCountFiltered1(pusher, 1);
+
+	return true;
+}
+
+
 #endif   /* NODESEQSCAN_H */
-- 
2.11.0

0005-Reversed-HashJoin-implementation.patchtext/x-diffDownload
From c8527127652a5c5914fbb2e7de5dbaa814ed6d94 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Sat, 11 Mar 2017 00:36:31 +0300
Subject: [PATCH 5/8] Reversed HashJoin implementation.

The main point here is that tuples are pushed immediately after the match,
e.g. we scan the whole bucket in one loop and pushTuple each match. Main logic
was rewritten without using state machine.
---
 src/backend/executor/execProcnode.c |  56 +++
 src/backend/executor/nodeHash.c     | 185 +++++++---
 src/backend/executor/nodeHashjoin.c | 683 ++++++++++++++----------------------
 src/include/executor/nodeHash.h     |  11 +-
 src/include/executor/nodeHashjoin.h |  86 ++++-
 src/include/nodes/execnodes.h       |   2 +
 6 files changed, 556 insertions(+), 467 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 1b81e30cd3..f2275876d6 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -181,6 +181,22 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 			result = (PlanState *) ExecInitSeqScan((SeqScan *) node,
 												   estate, eflags, parent);
 			break;
+
+		/*
+		 * join nodes
+		 */
+		case T_HashJoin:
+			result = (PlanState *) ExecInitHashJoin((HashJoin *) node,
+													estate, eflags, parent);
+			break;
+
+		/*
+		 * materialization nodes
+		 */
+		case T_Hash:
+			result = (PlanState *) ExecInitHash((Hash *) node,
+												estate, eflags, parent);
+			break;
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
@@ -244,6 +260,7 @@ bool
 ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
 {
 	PlanState *receiver = pusher->parent;
+	bool push_from_outer;
 
 	Assert(!TupIsNull(slot));
 
@@ -257,6 +274,16 @@ ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
 		return SendReadyTuple(slot, pusher);
 	}
 
+	if (nodeTag(receiver) == T_HashState)
+		return ExecPushTupleToHash(slot, (HashState *) receiver);
+
+	/* does push come from the outer side? */
+	push_from_outer = outerPlanState(receiver) == pusher;
+
+	if (nodeTag(receiver) == T_HashJoinState && push_from_outer)
+		return ExecPushTupleToHashJoinFromOuter(slot,
+											   (HashJoinState *) receiver);
+
 	elog(ERROR, "node type not supported: %d", (int) nodeTag(receiver));
 }
 
@@ -271,6 +298,7 @@ void
 ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
 {
 	PlanState *receiver = pusher->parent;
+	bool push_from_outer;
 
 	Assert(TupIsNull(slot));
 
@@ -286,6 +314,20 @@ ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
 		return;
 	}
 
+	if (nodeTag(receiver) == T_HashState)
+		return ExecPushNullToHash(slot, (HashState *) receiver);
+
+	/* does push come from the outer side? */
+	push_from_outer = outerPlanState(receiver) == pusher;
+
+	if (nodeTag(receiver) == T_HashJoinState && push_from_outer)
+		return ExecPushNullToHashJoinFromOuter(slot,
+											   (HashJoinState *) receiver);
+
+	else if (nodeTag(receiver) == T_HashJoinState && !push_from_outer)
+		return ExecPushNullToHashJoinFromInner(slot,
+											   (HashJoinState *) receiver);
+
 	elog(ERROR, "node type not supported: %d", (int) nodeTag(receiver));
 }
 
@@ -335,6 +377,20 @@ ExecEndNode(PlanState *node)
 			ExecEndSeqScan((SeqScanState *) node);
 			break;
 
+		/*
+		 * join nodes
+		 */
+		case T_HashJoinState:
+			ExecEndHashJoin((HashJoinState *) node);
+			break;
+
+		/*
+		 * materialization nodes
+		 */
+		case T_HashState:
+			ExecEndHash((HashState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 43e65ca04e..e12fc180d8 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -50,17 +50,99 @@ static void ExecHashRemoveNextSkewBucket(HashJoinTable hashtable);
 
 static void *dense_alloc(HashJoinTable hashtable, Size size);
 
-/* ----------------------------------------------------------------
- *		ExecHash
- *
- *		stub for pro forma compliance
- * ----------------------------------------------------------------
+
+/*
+ * Put incoming tuples to the hastable.
  */
-TupleTableSlot *
-ExecHash(HashState *node)
+bool
+ExecPushTupleToHash(TupleTableSlot *slot, HashState *node)
 {
-	elog(ERROR, "Hash node does not support ExecProcNode call convention");
-	return NULL;
+	List	   *hashkeys;
+	HashJoinTable hashtable;
+	ExprContext *econtext;
+	uint32		hashvalue;
+	HashJoinState *hj_node;
+
+	hj_node = (HashJoinState *) node->ps.parent;
+
+	/* Create the hastable. In vanilla Postgres this code is in HashJoin */
+	if (node->first_time_through)
+	{
+		Assert(node->hashtable == NULL);
+
+		node->hashtable = ExecHashTableCreate((Hash *) node->ps.plan,
+											  hj_node->hj_HashOperators,
+											  HJ_FILL_INNER(hj_node));
+
+		/* must provide our own instrumentation support */
+		if (node->ps.instrument)
+			InstrStartNode(node->ps.instrument);
+
+		node->first_time_through = false;
+	}
+
+	/*
+	 * get state info from node
+	 */
+	hashtable = node->hashtable;
+
+	/*
+	 * set expression context
+	 */
+	hashkeys = node->hashkeys;
+	econtext = node->ps.ps_ExprContext;
+
+	/* We have to compute the hash value */
+	econtext->ecxt_innertuple = slot;
+	if (ExecHashGetHashValue(hashtable, econtext, hashkeys,
+							 false, hashtable->keepNulls,
+							 &hashvalue))
+	{
+		int			bucketNumber;
+
+		bucketNumber = ExecHashGetSkewBucket(hashtable, hashvalue);
+		if (bucketNumber != INVALID_SKEW_BUCKET_NO)
+		{
+			/* It's a skew tuple, so put it into that hash table */
+			ExecHashSkewTableInsert(hashtable, slot, hashvalue,
+									bucketNumber);
+			hashtable->skewTuples += 1;
+		}
+		else
+		{
+			/* Not subject to skew optimization, so insert normally */
+			ExecHashTableInsert(hashtable, slot, hashvalue);
+		}
+		hashtable->totalTuples += 1;
+	}
+
+	/* ready to accept another tuple */
+	return true;
+}
+
+/*
+ * NULL received, finalize building the hashatable and notify HashJoin about
+ * that.
+ */
+void
+ExecPushNullToHash(TupleTableSlot *slot, HashState *node)
+{
+	HashJoinTable hashtable = node->hashtable;
+
+	/* resize the hash table if needed (NTUP_PER_BUCKET exceeded) */
+	if (hashtable->nbuckets != hashtable->nbuckets_optimal)
+		ExecHashIncreaseNumBuckets(hashtable);
+
+	/* Account for the buckets in spaceUsed (reported in EXPLAIN ANALYZE) */
+	hashtable->spaceUsed += hashtable->nbuckets * sizeof(HashJoinTuple);
+	if (hashtable->spaceUsed > hashtable->spacePeak)
+		hashtable->spacePeak = hashtable->spaceUsed;
+
+	/* must provide our own instrumentation support */
+	if (node->ps.instrument)
+		InstrStopNode(node->ps.instrument, hashtable->totalTuples);
+
+	ExecPushNull(slot, ((PlanState *) node));
 }
 
 /* ----------------------------------------------------------------
@@ -159,7 +241,7 @@ MultiExecHash(HashState *node)
  * ----------------------------------------------------------------
  */
 HashState *
-ExecInitHash(Hash *node, EState *estate, int eflags)
+ExecInitHash(Hash *node, EState *estate, int eflags, PlanState *parent)
 {
 	HashState  *hashstate;
 
@@ -172,8 +254,10 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	hashstate = makeNode(HashState);
 	hashstate->ps.plan = (Plan *) node;
 	hashstate->ps.state = estate;
+	hashstate->ps.parent = parent;
 	hashstate->hashtable = NULL;
 	hashstate->hashkeys = NIL;	/* will be set by parent HashJoin */
+	hashstate->first_time_through = true;
 
 	/*
 	 * Miscellaneous initialization
@@ -201,7 +285,7 @@ ExecInitHash(Hash *node, EState *estate, int eflags)
 	 * initialize child nodes
 	 */
 	outerPlanState(hashstate) = ExecInitNode(outerPlan(node), estate, eflags,
-											 (PlanState*) hashstate);
+											 (PlanState *) hashstate);
 
 	/*
 	 * initialize tuple type. no need to initialize projection info because
@@ -1051,34 +1135,35 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 }
 
 /*
- * ExecScanHashBucket
- *		scan a hash bucket for matches to the current outer tuple
+ * ExecScanHashBucketAndPush
+ *		scan a hash bucket for matches to the current outer tuple and push
+ *		them
  *
  * The current outer tuple must be stored in econtext->ecxt_outertuple.
  *
- * On success, the inner tuple is stored into hjstate->hj_CurTuple and
- * econtext->ecxt_innertuple, using hjstate->hj_HashTupleSlot as the slot
- * for the latter.
+ * Returns true, if parent still accepts tuples, false otherwise.
  */
 bool
-ExecScanHashBucket(HashJoinState *hjstate,
-				   ExprContext *econtext)
+ExecScanHashBucketAndPush(HashJoinState *hjstate,
+						  ExprContext *econtext)
 {
 	List	   *hjclauses = hjstate->hashclauses;
 	HashJoinTable hashtable = hjstate->hj_HashTable;
-	HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+	HashJoinTuple hashTuple;
 	uint32		hashvalue = hjstate->hj_CurHashValue;
+	JoinType	jointype = hjstate->js.jointype;
+
+	/*
+	 * For now, we don't support pausing execution; we either push all matching
+	 * tuples from the bucket at once or don't touch it at all.
+	 */
+	Assert(hjstate->hj_CurTuple == NULL);
 
 	/*
-	 * hj_CurTuple is the address of the tuple last returned from the current
-	 * bucket, or NULL if it's time to start scanning a new bucket.
-	 *
 	 * If the tuple hashed to a skew bucket then scan the skew bucket
 	 * otherwise scan the standard hashtable bucket.
 	 */
-	if (hashTuple != NULL)
-		hashTuple = hashTuple->next;
-	else if (hjstate->hj_CurSkewBucketNo != INVALID_SKEW_BUCKET_NO)
+	if (hjstate->hj_CurSkewBucketNo != INVALID_SKEW_BUCKET_NO)
 		hashTuple = hashtable->skewBucket[hjstate->hj_CurSkewBucketNo]->tuples;
 	else
 		hashTuple = hashtable->buckets[hjstate->hj_CurBucketNo];
@@ -1101,17 +1186,21 @@ ExecScanHashBucket(HashJoinState *hjstate,
 			if (ExecQual(hjclauses, econtext, false))
 			{
 				hjstate->hj_CurTuple = hashTuple;
-				return true;
+
+				if (!CheckJoinQualAndPush(hjstate))
+					return false;
+
+				/* if in anti or semi join tuple matched we are done with it */
+				if (hjstate->hj_MatchedOuter &&
+					(jointype == JOIN_ANTI || jointype == JOIN_SEMI))
+					return true;
 			}
 		}
 
 		hashTuple = hashTuple->next;
 	}
 
-	/*
-	 * no match
-	 */
-	return false;
+	return true;
 }
 
 /*
@@ -1135,18 +1224,23 @@ ExecPrepHashTableForUnmatched(HashJoinState *hjstate)
 }
 
 /*
- * ExecScanHashTableForUnmatched
- *		scan the hash table for unmatched inner tuples
+ * ExecScanHashTableForUnmatchedAndPush
+ *		scan the hash table for unmatched inner tuples and push them
  *
- * On success, the inner tuple is stored into hjstate->hj_CurTuple and
- * econtext->ecxt_innertuple, using hjstate->hj_HashTupleSlot as the slot
- * for the latter.
+ * Returns true, if parent still accepts tuples, false otherwise
  */
 bool
-ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
+ExecScanHashTableForUnmatchedAndPush(HashJoinState *hjstate,
+									 ExprContext *econtext)
 {
 	HashJoinTable hashtable = hjstate->hj_HashTable;
-	HashJoinTuple hashTuple = hjstate->hj_CurTuple;
+	HashJoinTuple hashTuple = NULL;
+	bool parent_accepts_tuples = true;
+
+	/*
+	 * For now, we don't support pausing execution and don't enter here twice
+	 */
+	Assert(hjstate->hj_CurTuple == NULL);
 
 	for (;;)
 	{
@@ -1191,18 +1285,25 @@ ExecScanHashTableForUnmatched(HashJoinState *hjstate, ExprContext *econtext)
 				 */
 				ResetExprContext(econtext);
 
+				/*
+				 * Since right now we don't support pausing execution anyway,
+				 * it is probably unnecessary.
+				 */
 				hjstate->hj_CurTuple = hashTuple;
-				return true;
+
+				/*
+				 * Generate a fake join tuple with nulls for the outer tuple,
+				 * and return it if it passes the non-join quals.
+				 */
+				econtext->ecxt_outertuple = hjstate->hj_NullOuterTupleSlot;
+				parent_accepts_tuples = CheckOtherQualAndPush(hjstate);
 			}
 
 			hashTuple = hashTuple->next;
 		}
 	}
 
-	/*
-	 * no more unmatched tuples
-	 */
-	return false;
+	return parent_accepts_tuples;
 }
 
 /*
diff --git a/src/backend/executor/nodeHashjoin.c b/src/backend/executor/nodeHashjoin.c
index b48863f90b..287d1b5c20 100644
--- a/src/backend/executor/nodeHashjoin.c
+++ b/src/backend/executor/nodeHashjoin.c
@@ -25,353 +25,285 @@
 
 
 /*
- * States of the ExecHashJoin state machine
+ * nodeHashJoin execution states
  */
-#define HJ_BUILD_HASHTABLE		1
-#define HJ_NEED_NEW_OUTER		2
-#define HJ_SCAN_BUCKET			3
-#define HJ_FILL_OUTER_TUPLE		4
-#define HJ_FILL_INNER_TUPLES	5
-#define HJ_NEED_NEW_BATCH		6
-
-/* Returns true if doing null-fill on outer relation */
-#define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
-/* Returns true if doing null-fill on inner relation */
-#define HJ_FILL_INNER(hjstate)	((hjstate)->hj_NullOuterTupleSlot != NULL)
-
-static TupleTableSlot *ExecHashJoinOuterGetTuple(PlanState *outerNode,
-						  HashJoinState *hjstate,
-						  uint32 *hashvalue);
-static TupleTableSlot *ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
-						  BufFile *file,
-						  uint32 *hashvalue,
-						  TupleTableSlot *tupleSlot);
+#define HJ_BUILD_HASHTABLE				1
+#define HJ_SCAN_OUTER					2
+#define HJ_DONE							3
+/* left ONLY to avoid deleting ReScan, it is not used */
+#define HJ_NEED_NEW_OUTER			   -1
+
+static TupleTableSlot *ExecHashJoinGetSavedTuple(BufFile *file,
+												 uint32 *hashvalue,
+												 TupleTableSlot *tupleSlot);
 static bool ExecHashJoinNewBatch(HashJoinState *hjstate);
+static TupleTableSlot *ExecHashJoinTakeOuterFromTempFile(HashJoinState *hjstate,
+														 uint32 *hashvalue);
+static inline bool ExecHashJoinNewOuter(HashJoinState *hjstate);
+static inline bool ExecHashJoinEndOfBatch(HashJoinState *hjstate);
 
-
-/* ----------------------------------------------------------------
- *		ExecHashJoin
- *
- *		This function implements the Hybrid Hashjoin algorithm.
- *
- *		Note: the relation we build hash table on is the "inner"
- *			  the other one is "outer".
- * ----------------------------------------------------------------
+/*
+ * This function will be called from Hash node with NULL slot, signaling
+ * that the hashtable is built.
+ * "Extract-one-outer-tuple-to-check-if-it-is-null-before-building-hashtable"
+ * optimization is not implemented for now, the hashtable will be always built
+ * first.
  */
-TupleTableSlot *				/* return: a tuple or NULL */
-ExecHashJoin(HashJoinState *node)
+void
+ExecPushNullToHashJoinFromInner(TupleTableSlot *slot, HashJoinState *node)
 {
-	PlanState  *outerNode;
-	HashState  *hashNode;
-	List	   *joinqual;
-	List	   *otherqual;
-	ExprContext *econtext;
 	HashJoinTable hashtable;
-	TupleTableSlot *outerTupleSlot;
-	uint32		hashvalue;
-	int			batchno;
+	HashState *hashNode;
 
-	/*
-	 * get information from HashJoin node
-	 */
-	joinqual = node->js.joinqual;
-	otherqual = node->js.ps.qual;
 	hashNode = (HashState *) innerPlanState(node);
-	outerNode = outerPlanState(node);
-	hashtable = node->hj_HashTable;
-	econtext = node->js.ps.ps_ExprContext;
+
+	/* we should get there only once */
+	Assert(node->hj_JoinState == HJ_BUILD_HASHTABLE);
+	/* we will fish out the tuples from Hash node ourselves */
+	Assert(TupIsNull(slot));
+
+	/* we always build the hashtable first */
+	node->hj_FirstOuterTupleSlot = NULL;
+
+	hashtable = hashNode->hashtable;
+	node->hj_HashTable = hashtable;
 
 	/*
-	 * Reset per-tuple memory context to free any expression evaluation
-	 * storage allocated in the previous tuple cycle.
+	 * need to remember whether nbatch has increased since we
+	 * began scanning the outer relation
 	 */
-	ResetExprContext(econtext);
+	hashtable->nbatch_outstart = hashtable->nbatch;
 
 	/*
-	 * run the hash join state machine
+	 * Reset OuterNotEmpty for scan.
 	 */
-	for (;;)
+	node->hj_OuterNotEmpty = false;
+
+	node->hj_JoinState = HJ_SCAN_OUTER;
+}
+
+/*
+ * Null push from the outer side, so this is the end of the first
+ * batch. Finalize it and handle other batches, taking outer tuples from temp
+ * files.
+ */
+void
+ExecPushNullToHashJoinFromOuter(TupleTableSlot *slot, HashJoinState *node)
+{
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+	HashJoinTable hashtable = node->hj_HashTable;
+	/* we don't need it, just to conform ExecHashGetBucketAndBatch signature */
+	int batchno;
+	bool parent_accepts_tuples;
+
+	/* We must always be in this state during pushes from outer  */
+	Assert(node->hj_JoinState == HJ_SCAN_OUTER);
+
+	/* end of the first batch */
+	parent_accepts_tuples = ExecHashJoinEndOfBatch(node);
+
+	/* loop while we run out of batches or parent stops accepting tuples */
+	while (node->hj_JoinState != HJ_DONE && parent_accepts_tuples)
 	{
-		switch (node->hj_JoinState)
+		slot = ExecHashJoinTakeOuterFromTempFile(node, &node->hj_CurHashValue);
+		if (TupIsNull(slot))
+			/* end of batch, no more outer tuples here */
+			parent_accepts_tuples = ExecHashJoinEndOfBatch(node);
+		else
 		{
-			case HJ_BUILD_HASHTABLE:
-
-				/*
-				 * First time through: build hash table for inner relation.
-				 */
-				Assert(hashtable == NULL);
-
-				/*
-				 * If the outer relation is completely empty, and it's not
-				 * right/full join, we can quit without building the hash
-				 * table.  However, for an inner join it is only a win to
-				 * check this when the outer relation's startup cost is less
-				 * than the projected cost of building the hash table.
-				 * Otherwise it's best to build the hash table first and see
-				 * if the inner relation is empty.  (When it's a left join, we
-				 * should always make this check, since we aren't going to be
-				 * able to skip the join on the strength of an empty inner
-				 * relation anyway.)
-				 *
-				 * If we are rescanning the join, we make use of information
-				 * gained on the previous scan: don't bother to try the
-				 * prefetch if the previous scan found the outer relation
-				 * nonempty. This is not 100% reliable since with new
-				 * parameters the outer relation might yield different
-				 * results, but it's a good heuristic.
-				 *
-				 * The only way to make the check is to try to fetch a tuple
-				 * from the outer plan node.  If we succeed, we have to stash
-				 * it away for later consumption by ExecHashJoinOuterGetTuple.
-				 */
-				if (HJ_FILL_INNER(node))
-				{
-					/* no chance to not build the hash table */
-					node->hj_FirstOuterTupleSlot = NULL;
-				}
-				else if (HJ_FILL_OUTER(node) ||
-						 (outerNode->plan->startup_cost < hashNode->ps.plan->total_cost &&
-						  !node->hj_OuterNotEmpty))
-				{
-					node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode);
-					if (TupIsNull(node->hj_FirstOuterTupleSlot))
-					{
-						node->hj_OuterNotEmpty = false;
-						return NULL;
-					}
-					else
-						node->hj_OuterNotEmpty = true;
-				}
-				else
-					node->hj_FirstOuterTupleSlot = NULL;
-
-				/*
-				 * create the hash table
-				 */
-				hashtable = ExecHashTableCreate((Hash *) hashNode->ps.plan,
-												node->hj_HashOperators,
-												HJ_FILL_INNER(node));
-				node->hj_HashTable = hashtable;
-
-				/*
-				 * execute the Hash node, to build the hash table
-				 */
-				hashNode->hashtable = hashtable;
-				(void) MultiExecProcNode((PlanState *) hashNode);
-
-				/*
-				 * If the inner relation is completely empty, and we're not
-				 * doing a left outer join, we can quit without scanning the
-				 * outer relation.
-				 */
-				if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
-					return NULL;
-
-				/*
-				 * need to remember whether nbatch has increased since we
-				 * began scanning the outer relation
-				 */
-				hashtable->nbatch_outstart = hashtable->nbatch;
-
-				/*
-				 * Reset OuterNotEmpty for scan.  (It's OK if we fetched a
-				 * tuple above, because ExecHashJoinOuterGetTuple will
-				 * immediately set it again.)
-				 */
-				node->hj_OuterNotEmpty = false;
-
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
-
-				/* FALL THRU */
-
-			case HJ_NEED_NEW_OUTER:
-
-				/*
-				 * We don't have an outer tuple, try to get the next one
-				 */
-				outerTupleSlot = ExecHashJoinOuterGetTuple(outerNode,
-														   node,
-														   &hashvalue);
-				if (TupIsNull(outerTupleSlot))
-				{
-					/* end of batch, or maybe whole join */
-					if (HJ_FILL_INNER(node))
-					{
-						/* set up to scan for unmatched inner tuples */
-						ExecPrepHashTableForUnmatched(node);
-						node->hj_JoinState = HJ_FILL_INNER_TUPLES;
-					}
-					else
-						node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
-				}
-
-				econtext->ecxt_outertuple = outerTupleSlot;
-				node->hj_MatchedOuter = false;
-
-				/*
-				 * Find the corresponding bucket for this tuple in the main
-				 * hash table or skew hash table.
-				 */
-				node->hj_CurHashValue = hashvalue;
-				ExecHashGetBucketAndBatch(hashtable, hashvalue,
-										  &node->hj_CurBucketNo, &batchno);
-				node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
-																 hashvalue);
-				node->hj_CurTuple = NULL;
-
-				/*
-				 * The tuple might not belong to the current batch (where
-				 * "current batch" includes the skew buckets if any).
-				 */
-				if (batchno != hashtable->curbatch &&
-					node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
-				{
-					/*
-					 * Need to postpone this outer tuple to a later batch.
-					 * Save it in the corresponding outer-batch file.
-					 */
-					Assert(batchno > hashtable->curbatch);
-					ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(outerTupleSlot),
-										  hashvalue,
-										&hashtable->outerBatchFile[batchno]);
-					/* Loop around, staying in HJ_NEED_NEW_OUTER state */
-					continue;
-				}
-
-				/* OK, let's scan the bucket for matches */
-				node->hj_JoinState = HJ_SCAN_BUCKET;
-
-				/* FALL THRU */
-
-			case HJ_SCAN_BUCKET:
-
-				/*
-				 * We check for interrupts here because this corresponds to
-				 * where we'd fetch a row from a child plan node in other join
-				 * types.
-				 */
-				CHECK_FOR_INTERRUPTS();
-
-				/*
-				 * Scan the selected hash bucket for matches to current outer
-				 */
-				if (!ExecScanHashBucket(node, econtext))
-				{
-					/* out of matches; check for possible outer-join fill */
-					node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
-					continue;
-				}
-
-				/*
-				 * We've got a match, but still need to test non-hashed quals.
-				 * ExecScanHashBucket already set up all the state needed to
-				 * call ExecQual.
-				 *
-				 * If we pass the qual, then save state for next call and have
-				 * ExecProject form the projection, store it in the tuple
-				 * table, and return the slot.
-				 *
-				 * Only the joinquals determine tuple match status, but all
-				 * quals must pass to actually return the tuple.
-				 */
-				if (joinqual == NIL || ExecQual(joinqual, econtext, false))
-				{
-					node->hj_MatchedOuter = true;
-					HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
-
-					/* In an antijoin, we never return a matched tuple */
-					if (node->js.jointype == JOIN_ANTI)
-					{
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-						continue;
-					}
-
-					/*
-					 * In a semijoin, we'll consider returning the first
-					 * match, but after that we're done with this outer tuple.
-					 */
-					if (node->js.jointype == JOIN_SEMI)
-						node->hj_JoinState = HJ_NEED_NEW_OUTER;
-
-					if (otherqual == NIL ||
-						ExecQual(otherqual, econtext, false))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
-				}
-				else
-					InstrCountFiltered1(node, 1);
-				break;
-
-			case HJ_FILL_OUTER_TUPLE:
-
-				/*
-				 * The current outer tuple has run out of matches, so check
-				 * whether to emit a dummy outer-join tuple.  Whether we emit
-				 * one or not, the next state is NEED_NEW_OUTER.
-				 */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
-
-				if (!node->hj_MatchedOuter &&
-					HJ_FILL_OUTER(node))
-				{
-					/*
-					 * Generate a fake join tuple with nulls for the inner
-					 * tuple, and return it if it passes the non-join quals.
-					 */
-					econtext->ecxt_innertuple = node->hj_NullInnerTupleSlot;
-
-					if (otherqual == NIL ||
-						ExecQual(otherqual, econtext, false))
-						return ExecProject(node->js.ps.ps_ProjInfo);
-					else
-						InstrCountFiltered2(node, 1);
-				}
-				break;
-
-			case HJ_FILL_INNER_TUPLES:
-
-				/*
-				 * We have finished a batch, but we are doing right/full join,
-				 * so any unmatched inner tuples in the hashtable have to be
-				 * emitted before we continue to the next batch.
-				 */
-				if (!ExecScanHashTableForUnmatched(node, econtext))
-				{
-					/* no more unmatched tuples */
-					node->hj_JoinState = HJ_NEED_NEW_BATCH;
-					continue;
-				}
-
-				/*
-				 * Generate a fake join tuple with nulls for the outer tuple,
-				 * and return it if it passes the non-join quals.
-				 */
-				econtext->ecxt_outertuple = node->hj_NullOuterTupleSlot;
-
-				if (otherqual == NIL ||
-					ExecQual(otherqual, econtext, false))
-					return ExecProject(node->js.ps.ps_ProjInfo);
-				else
-					InstrCountFiltered2(node, 1);
-				break;
-
-			case HJ_NEED_NEW_BATCH:
-
-				/*
-				 * Try to advance to next batch.  Done if there are no more.
-				 */
-				if (!ExecHashJoinNewBatch(node))
-					return NULL;	/* end of join */
-				node->hj_JoinState = HJ_NEED_NEW_OUTER;
-				break;
-
-			default:
-				elog(ERROR, "unrecognized hashjoin state: %d",
-					 (int) node->hj_JoinState);
+			econtext->ecxt_outertuple = slot;
+			/*
+			 * Find the corresponding bucket for this tuple in the main
+			 * hash table or skew hash table.
+			 */
+			ExecHashGetBucketAndBatch(hashtable, node->hj_CurHashValue,
+									  &node->hj_CurBucketNo, &batchno);
+			node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
+															 node->hj_CurHashValue);
+			parent_accepts_tuples = ExecHashJoinNewOuter(node);
 		}
+
+	}
+}
+
+/*
+ * Non-null push from the outer side. Finds matches and sends them upward to
+ * HashJoin's parent. Returns true if the parent still waits for tuples, false
+ * otherwise. When this function is called, the hashtable must already be
+ * filled.
+ */
+bool
+ExecPushTupleToHashJoinFromOuter(TupleTableSlot *slot,
+								 HashJoinState *node)
+{
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+	HashJoinTable hashtable = node->hj_HashTable;
+	int			batchno;
+
+	/* We must always be in this state during pushes from outer	 */
+	Assert(node->hj_JoinState == HJ_SCAN_OUTER);
+
+	/*
+	 * We have to compute the tuple's hash value.
+	 */
+	econtext->ecxt_outertuple = slot;
+	if (!ExecHashGetHashValue(hashtable, econtext,
+							  node->hj_OuterHashKeys,
+							  true,		/* outer tuple */
+							  HJ_FILL_OUTER(node),
+							  &node->hj_CurHashValue))
+	{
+		/*
+		 * That tuple couldn't match because of a NULL hashed attr, so discard
+		 * it and wait for the next one.
+		 */
+		return true;
+	}
+
+	/*
+	 * Find the corresponding bucket for this tuple in the main
+	 * hash table or skew hash table.
+	 */
+	ExecHashGetBucketAndBatch(hashtable, node->hj_CurHashValue,
+							  &node->hj_CurBucketNo, &batchno);
+	node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
+													 node->hj_CurHashValue);
+
+	/*
+	 * The tuple might not belong to the current batch which is 0 (it is
+	 * always 0 while we receiving non-null tuples from below). "current
+	 * batch" also includes the skew buckets if any).
+	 */
+	if (batchno != 0 && node->hj_CurSkewBucketNo == INVALID_SKEW_BUCKET_NO)
+	{
+		/*
+		 * Need to postpone this outer tuple to a later batch.
+		 * Save it in the corresponding outer-batch file.
+		 */
+		Assert(batchno > hashtable->curbatch);
+		ExecHashJoinSaveTuple(ExecFetchSlotMinimalTuple(slot),
+							  node->hj_CurHashValue,
+							  &hashtable->outerBatchFile[batchno]);
+		/* wait for the next tuple */
+		return true;
+	}
+
+	/*
+	 * Ok, now we have non-null outer tuple which belongs to current batch --
+	 * time to search for matches
+	 */
+	return ExecHashJoinNewOuter(node);
+}
+
+/*
+ * Called when we process non-null tuple from the outer side. Finds matches
+ * for it and pushes them. Pushes dummy outer-join tuple, if no matches were
+ * found.
+ *
+ * Outer tuple must be stored in
+ * hjstate->js.ps.ps_ExprContext->ecxt_outertuple. Besides, bucket to scan
+ * must be stored in node->CurBucketNo and node->hj_CurSkewBucketNo; probably
+ * we could calculate them right in this function, but then we would have to
+ * add here batchno != hashtable->curbatch check which is not needed when
+ * batcho > 0, not nice.
+ *
+ * Returns true if the parent still waits for tuples, false otherwise.
+ */
+static inline bool ExecHashJoinNewOuter(HashJoinState *hjstate)
+{
+	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
+	JoinType	jointype = hjstate->js.jointype;
+
+	/* not sure we should do it here */
+	CHECK_FOR_INTERRUPTS();
+
+	hjstate->hj_CurTuple = NULL;
+	hjstate->hj_MatchedOuter = false;
+
+	/*
+	 * Push all matching tuples from selected hash bucket
+	 */
+	if (!ExecScanHashBucketAndPush(hjstate, econtext))
+		return false;
+
+	/* if in anti or semi join tuple matched we are done with it */
+	if (hjstate->hj_MatchedOuter &&
+		(jointype == JOIN_ANTI || jointype == JOIN_SEMI))
+		return true;
+
+	if (!hjstate->hj_MatchedOuter && HJ_FILL_OUTER(hjstate))
+	{
+		/*
+		 * Generate a fake join tuple with nulls for the inner
+		 * tuple, and push it if it passes the non-join quals.
+		 */
+		econtext->ecxt_innertuple = hjstate->hj_NullInnerTupleSlot;
+
+		return CheckOtherQualAndPush(hjstate);
 	}
+
+	return true;
+}
+
+/*
+ * Called when we finished the batch; push unmatched inner tuples, if we are
+ * filling inner, and advance the batch. Returns true if the parent still
+ * waits for tuples, false otherwise. Sets hjstate->hj_JoinState to HJ_DONE
+ * if there are no more batches: this is the end of join. It signals parent
+ * about the latter, pushing NULL -- caller mus.
+ */
+static inline bool ExecHashJoinEndOfBatch(HashJoinState *hjstate)
+{
+	ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
+
+	if (HJ_FILL_INNER(hjstate))
+	{
+		/*
+		 * We are doing right/full join,
+		 * so any unmatched inner tuples in the hashtable have to be
+		 * emitted before we continue to the next batch.
+		 */
+
+		/* set up to scan for unmatched inner tuples */
+		ExecPrepHashTableForUnmatched(hjstate);
+		if (!ExecScanHashTableForUnmatchedAndPush(hjstate, econtext))
+			return false;
+	}
+
+	/* advance the batch TODO */
+	if (!ExecHashJoinNewBatch(hjstate))
+	{
+		hjstate->hj_JoinState = HJ_DONE;	/* end of join */
+		/* let parent know that we are done */
+		ExecPushNull(NULL, (PlanState *) hjstate);
+	}
+
+	return true;
+}
+
+/*
+ * Get next outer tuple from saved temp files. We are processing not the first
+ * batch if we are here. On success, the tuple's hash value is stored at
+ * *hashvalue, re-read from the temp file.
+ * Returns NULL on the end of batch, a tuple otherwise.
+ */
+static TupleTableSlot *ExecHashJoinTakeOuterFromTempFile(HashJoinState *hjstate,
+														 uint32 *hashvalue)
+{
+	HashJoinTable hashtable = hjstate->hj_HashTable;
+	int			curbatch = hashtable->curbatch;
+	BufFile    *file = hashtable->outerBatchFile[curbatch];
+
+	/*
+	 * In outer-join cases, we could get here even though the batch file
+	 * is empty.
+	 */
+	if (file == NULL)
+		return NULL;
+
+	return ExecHashJoinGetSavedTuple(file,
+									 hashvalue,
+									 hjstate->hj_OuterTupleSlot);
 }
 
 /* ----------------------------------------------------------------
@@ -381,7 +313,7 @@ ExecHashJoin(HashJoinState *node)
  * ----------------------------------------------------------------
  */
 HashJoinState *
-ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
+ExecInitHashJoin(HashJoin *node, EState *estate, int eflags, PlanState *parent)
 {
 	HashJoinState *hjstate;
 	Plan	   *outerNode;
@@ -400,6 +332,7 @@ ExecInitHashJoin(HashJoin *node, EState *estate, int eflags)
 	hjstate = makeNode(HashJoinState);
 	hjstate->js.ps.plan = (Plan *) node;
 	hjstate->js.ps.state = estate;
+	hjstate->js.ps.parent = parent;
 
 	/*
 	 * Miscellaneous initialization
@@ -579,89 +512,6 @@ ExecEndHashJoin(HashJoinState *node)
 }
 
 /*
- * ExecHashJoinOuterGetTuple
- *
- *		get the next outer tuple for hashjoin: either by
- *		executing the outer plan node in the first pass, or from
- *		the temp files for the hashjoin batches.
- *
- * Returns a null slot if no more outer tuples (within the current batch).
- *
- * On success, the tuple's hash value is stored at *hashvalue --- this is
- * either originally computed, or re-read from the temp file.
- */
-static TupleTableSlot *
-ExecHashJoinOuterGetTuple(PlanState *outerNode,
-						  HashJoinState *hjstate,
-						  uint32 *hashvalue)
-{
-	HashJoinTable hashtable = hjstate->hj_HashTable;
-	int			curbatch = hashtable->curbatch;
-	TupleTableSlot *slot;
-
-	if (curbatch == 0)			/* if it is the first pass */
-	{
-		/*
-		 * Check to see if first outer tuple was already fetched by
-		 * ExecHashJoin() and not used yet.
-		 */
-		slot = hjstate->hj_FirstOuterTupleSlot;
-		if (!TupIsNull(slot))
-			hjstate->hj_FirstOuterTupleSlot = NULL;
-		else
-			slot = ExecProcNode(outerNode);
-
-		while (!TupIsNull(slot))
-		{
-			/*
-			 * We have to compute the tuple's hash value.
-			 */
-			ExprContext *econtext = hjstate->js.ps.ps_ExprContext;
-
-			econtext->ecxt_outertuple = slot;
-			if (ExecHashGetHashValue(hashtable, econtext,
-									 hjstate->hj_OuterHashKeys,
-									 true,		/* outer tuple */
-									 HJ_FILL_OUTER(hjstate),
-									 hashvalue))
-			{
-				/* remember outer relation is not empty for possible rescan */
-				hjstate->hj_OuterNotEmpty = true;
-
-				return slot;
-			}
-
-			/*
-			 * That tuple couldn't match because of a NULL, so discard it and
-			 * continue with the next one.
-			 */
-			slot = ExecProcNode(outerNode);
-		}
-	}
-	else if (curbatch < hashtable->nbatch)
-	{
-		BufFile    *file = hashtable->outerBatchFile[curbatch];
-
-		/*
-		 * In outer-join cases, we could get here even though the batch file
-		 * is empty.
-		 */
-		if (file == NULL)
-			return NULL;
-
-		slot = ExecHashJoinGetSavedTuple(hjstate,
-										 file,
-										 hashvalue,
-										 hjstate->hj_OuterTupleSlot);
-		if (!TupIsNull(slot))
-			return slot;
-	}
-
-	/* End of this batch */
-	return NULL;
-}
-
-/*
  * ExecHashJoinNewBatch
  *		switch to a new hashjoin batch
  *
@@ -769,8 +619,7 @@ ExecHashJoinNewBatch(HashJoinState *hjstate)
 					(errcode_for_file_access(),
 				   errmsg("could not rewind hash-join temporary file: %m")));
 
-		while ((slot = ExecHashJoinGetSavedTuple(hjstate,
-												 innerFile,
+		while ((slot = ExecHashJoinGetSavedTuple(innerFile,
 												 &hashvalue,
 												 hjstate->hj_HashTupleSlot)))
 		{
@@ -849,8 +698,7 @@ ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
  * itself is stored in the given slot.
  */
 static TupleTableSlot *
-ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
-						  BufFile *file,
+ExecHashJoinGetSavedTuple(BufFile *file,
 						  uint32 *hashvalue,
 						  TupleTableSlot *tupleSlot)
 {
@@ -893,7 +741,6 @@ ExecHashJoinGetSavedTuple(HashJoinState *hjstate,
 	return ExecStoreMinimalTuple(tuple, tupleSlot, true);
 }
 
-
 void
 ExecReScanHashJoin(HashJoinState *node)
 {
diff --git a/src/include/executor/nodeHash.h b/src/include/executor/nodeHash.h
index fe5c2642d7..180dea991f 100644
--- a/src/include/executor/nodeHash.h
+++ b/src/include/executor/nodeHash.h
@@ -16,8 +16,10 @@
 
 #include "nodes/execnodes.h"
 
-extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHash(HashState *node);
+extern HashState *ExecInitHash(Hash *node, EState *estate, int eflags,
+							   PlanState* parent);
+extern bool ExecPushTupleToHash(TupleTableSlot *slot, HashState *node);
+extern void ExecPushNullToHash(TupleTableSlot *slot, HashState *node);
 extern Node *MultiExecHash(HashState *node);
 extern void ExecEndHash(HashState *node);
 extern void ExecReScanHash(HashState *node);
@@ -38,9 +40,10 @@ extern void ExecHashGetBucketAndBatch(HashJoinTable hashtable,
 						  uint32 hashvalue,
 						  int *bucketno,
 						  int *batchno);
-extern bool ExecScanHashBucket(HashJoinState *hjstate, ExprContext *econtext);
+extern bool ExecScanHashBucketAndPush(HashJoinState *hjstate,
+									  ExprContext *econtext);
 extern void ExecPrepHashTableForUnmatched(HashJoinState *hjstate);
-extern bool ExecScanHashTableForUnmatched(HashJoinState *hjstate,
+extern bool ExecScanHashTableForUnmatchedAndPush(HashJoinState *hjstate,
 							  ExprContext *econtext);
 extern void ExecHashTableReset(HashJoinTable hashtable);
 extern void ExecHashTableResetMatchFlags(HashJoinTable hashtable);
diff --git a/src/include/executor/nodeHashjoin.h b/src/include/executor/nodeHashjoin.h
index ddc32b1de3..817ad2259f 100644
--- a/src/include/executor/nodeHashjoin.h
+++ b/src/include/executor/nodeHashjoin.h
@@ -16,13 +16,93 @@
 
 #include "nodes/execnodes.h"
 #include "storage/buffile.h"
+#include "executor/executor.h"
+#include "executor/hashjoin.h"
+#include "access/htup_details.h"
+#include "utils/memutils.h"
 
-extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecHashJoin(HashJoinState *node);
+/* Returns true if doing null-fill on outer relation */
+#define HJ_FILL_OUTER(hjstate)	((hjstate)->hj_NullInnerTupleSlot != NULL)
+/* Returns true if doing null-fill on inner relation */
+#define HJ_FILL_INNER(hjstate)	((hjstate)->hj_NullOuterTupleSlot != NULL)
+
+extern HashJoinState *ExecInitHashJoin(HashJoin *node, EState *estate,
+									   int eflags, PlanState *parent);
+extern void ExecPushNullToHashJoinFromInner(TupleTableSlot *slot,
+											HashJoinState *hjstate);
+extern void ExecPushNullToHashJoinFromOuter(TupleTableSlot *slot,
+											HashJoinState *hjstate);
+extern bool ExecPushTupleToHashJoinFromOuter(TupleTableSlot *slot,
+											 HashJoinState *hjstate);
+extern bool pushTupleToHashJoinFromInner(TupleTableSlot *slot,
+								  HashJoinState *node);
+extern bool pushTupleToHashJoinFromOuter(TupleTableSlot *slot,
+										 HashJoinState *node);
 extern void ExecEndHashJoin(HashJoinState *node);
 extern void ExecReScanHashJoin(HashJoinState *node);
 
 extern void ExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,
 					  BufFile **fileptr);
 
-#endif   /* NODEHASHJOIN_H */
+/* inline funcs decls and implementations */
+static inline bool CheckOtherQualAndPush(HashJoinState *node);
+static inline bool CheckJoinQualAndPush(HashJoinState *node);
+
+/*
+ * Everything is ready for checking otherqual and projecting; do that,
+ * and push the result.
+ *
+ * Returns true if parent accepts more tuples, false otherwise
+ */
+static inline bool CheckOtherQualAndPush(HashJoinState *node)
+{
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+	List *otherqual = node->js.ps.qual;
+	TupleTableSlot *slot;
+
+	if (otherqual == NIL ||
+		ExecQual(otherqual, econtext, false))
+	{
+		slot = ExecProject(node->js.ps.ps_ProjInfo);
+		return ExecPushTuple(slot, (PlanState *) node);
+	}
+	else
+		InstrCountFiltered2(node, 1);
+	return true;
+}
+
+/*
+ * We have found inner tuple with hashed quals matched to the current outer
+ * tuple. Now check non-hashed quals, other quals, then project and push
+ * the result.
+ *
+ * State for ExecQual was already set by ExecScanHashBucketAndPush and before.
+ * Returns true if parent accepts more tuples, false otherwise.
+ */
+static inline bool CheckJoinQualAndPush(HashJoinState *node)
+{
+	List	   *joinqual = node->js.joinqual;
+	ExprContext *econtext = node->js.ps.ps_ExprContext;
+
+	/*
+	 * Only the joinquals determine tuple match status, but all
+	 * quals must pass to actually return the tuple.
+	 */
+	if (joinqual == NIL || ExecQual(joinqual, econtext, false))
+	{
+		node->hj_MatchedOuter = true;
+		HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
+
+		/* In an antijoin, we never return a matched tuple */
+		if (node->js.jointype == JOIN_ANTI)
+			return true;
+
+		return CheckOtherQualAndPush(node);
+	}
+	else
+		InstrCountFiltered1(node, 1);
+
+	return true;
+}
+
+#endif	 /* NODEHASHJOIN_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index da7fd9c7ac..abbe67ba0c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2145,6 +2145,8 @@ typedef struct HashState
 	HashJoinTable hashtable;	/* hash table for the hashjoin */
 	List	   *hashkeys;		/* list of ExprState nodes */
 	/* hashkeys is same as parent's hj_InnerHashKeys */
+	/* on the first push we must build the hashtable */
+	bool first_time_through;
 } HashState;
 
 /* ----------------
-- 
2.11.0

0006-Reversed-Limit-implementation.patchtext/x-diffDownload
From 34a5807fca043023c48fb4810710ab4ed42ce728 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Sat, 11 Mar 2017 02:33:47 +0300
Subject: [PATCH 6/8] Reversed Limit implementation.

---
 src/backend/executor/execProcnode.c |  18 ++-
 src/backend/executor/nodeLimit.c    | 244 ++++++++----------------------------
 src/include/executor/nodeLimit.h    |   6 +-
 src/include/nodes/execnodes.h       |  10 +-
 4 files changed, 78 insertions(+), 200 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f2275876d6..1ebb0da36f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -197,6 +197,12 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 			result = (PlanState *) ExecInitHash((Hash *) node,
 												estate, eflags, parent);
 			break;
+
+		case T_Limit:
+			result = (PlanState *) ExecInitLimit((Limit *) node,
+												 estate, eflags, parent);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
@@ -274,7 +280,9 @@ ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
 		return SendReadyTuple(slot, pusher);
 	}
 
-	if (nodeTag(receiver) == T_HashState)
+	if (nodeTag(receiver) == T_LimitState)
+		return ExecPushTupleToLimit(slot, (LimitState *) receiver);
+	else if (nodeTag(receiver) == T_HashState)
 		return ExecPushTupleToHash(slot, (HashState *) receiver);
 
 	/* does push come from the outer side? */
@@ -314,7 +322,9 @@ ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
 		return;
 	}
 
-	if (nodeTag(receiver) == T_HashState)
+	if (nodeTag(receiver) == T_LimitState)
+		return ExecPushNullToLimit(slot, (LimitState *) receiver);
+	else if (nodeTag(receiver) == T_HashState)
 		return ExecPushNullToHash(slot, (HashState *) receiver);
 
 	/* does push come from the outer side? */
@@ -391,6 +401,10 @@ ExecEndNode(PlanState *node)
 			ExecEndHash((HashState *) node);
 			break;
 
+		case T_LimitState:
+			ExecEndLimit((LimitState *) node);
+			break;
+
 		default:
 			elog(ERROR, "unrecognized/unsupported node type: %d",
 				 (int) nodeTag(node));
diff --git a/src/backend/executor/nodeLimit.c b/src/backend/executor/nodeLimit.c
index bcacbfc13b..ad3b6b436c 100644
--- a/src/backend/executor/nodeLimit.c
+++ b/src/backend/executor/nodeLimit.c
@@ -28,199 +28,66 @@
 static void recompute_limits(LimitState *node);
 static void pass_down_bound(LimitState *node, PlanState *child_node);
 
-
-/* ----------------------------------------------------------------
- *		ExecLimit
- *
- *		This is a very simple node which just performs LIMIT/OFFSET
- *		filtering on the stream of tuples returned by a subplan.
- * ----------------------------------------------------------------
- */
-TupleTableSlot *				/* return: a tuple or NULL */
-ExecLimit(LimitState *node)
+bool
+ExecPushTupleToLimit(TupleTableSlot *slot, LimitState *node)
 {
-	ScanDirection direction;
-	TupleTableSlot *slot;
-	PlanState  *outerPlan;
+	bool parent_accepts_tuples;
+	bool limit_accepts_tuples;
+	/* last tuple in the window just pushed */
+	bool last_tuple_pushed;
 
 	/*
-	 * get information from the node
+	 * Backward direction is not supported at the moment
 	 */
-	direction = node->ps.state->es_direction;
-	outerPlan = outerPlanState(node);
+	Assert(ScanDirectionIsForward(node->ps.state->es_direction));
+	/* guard against calling ExecPushTupleToLimit after it returned false */
+	Assert(node->lstate != LIMIT_DONE);
 
-	/*
-	 * The main logic is a simple state machine.
-	 */
-	switch (node->lstate)
+	if (node->lstate == LIMIT_INITIAL)
 	{
-		case LIMIT_INITIAL:
-
-			/*
-			 * First call for this node, so compute limit/offset. (We can't do
-			 * this any earlier, because parameters from upper nodes will not
-			 * be set during ExecInitLimit.)  This also sets position = 0 and
-			 * changes the state to LIMIT_RESCAN.
-			 */
-			recompute_limits(node);
-
-			/* FALL THRU */
-
-		case LIMIT_RESCAN:
-
-			/*
-			 * If backwards scan, just return NULL without changing state.
-			 */
-			if (!ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Check for empty window; if so, treat like empty subplan.
-			 */
-			if (node->count <= 0 && !node->noCount)
-			{
-				node->lstate = LIMIT_EMPTY;
-				return NULL;
-			}
-
-			/*
-			 * Fetch rows from subplan until we reach position > offset.
-			 */
-			for (;;)
-			{
-				slot = ExecProcNode(outerPlan);
-				if (TupIsNull(slot))
-				{
-					/*
-					 * The subplan returns too few tuples for us to produce
-					 * any output at all.
-					 */
-					node->lstate = LIMIT_EMPTY;
-					return NULL;
-				}
-				node->subSlot = slot;
-				if (++node->position > node->offset)
-					break;
-			}
-
-			/*
-			 * Okay, we have the first tuple of the window.
-			 */
-			node->lstate = LIMIT_INWINDOW;
-			break;
-
-		case LIMIT_EMPTY:
-
-			/*
-			 * The subplan is known to return no tuples (or not more than
-			 * OFFSET tuples, in general).  So we return no tuples.
-			 */
-			return NULL;
-
-		case LIMIT_INWINDOW:
-			if (ScanDirectionIsForward(direction))
-			{
-				/*
-				 * Forwards scan, so check for stepping off end of window. If
-				 * we are at the end of the window, return NULL without
-				 * advancing the subplan or the position variable; but change
-				 * the state machine state to record having done so.
-				 */
-				if (!node->noCount &&
-					node->position - node->offset >= node->count)
-				{
-					node->lstate = LIMIT_WINDOWEND;
-					return NULL;
-				}
-
-				/*
-				 * Get next tuple from subplan, if any.
-				 */
-				slot = ExecProcNode(outerPlan);
-				if (TupIsNull(slot))
-				{
-					node->lstate = LIMIT_SUBPLANEOF;
-					return NULL;
-				}
-				node->subSlot = slot;
-				node->position++;
-			}
-			else
-			{
-				/*
-				 * Backwards scan, so check for stepping off start of window.
-				 * As above, change only state-machine status if so.
-				 */
-				if (node->position <= node->offset + 1)
-				{
-					node->lstate = LIMIT_WINDOWSTART;
-					return NULL;
-				}
-
-				/*
-				 * Get previous tuple from subplan; there should be one!
-				 */
-				slot = ExecProcNode(outerPlan);
-				if (TupIsNull(slot))
-					elog(ERROR, "LIMIT subplan failed to run backwards");
-				node->subSlot = slot;
-				node->position--;
-			}
-			break;
-
-		case LIMIT_SUBPLANEOF:
-			if (ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Backing up from subplan EOF, so re-fetch previous tuple; there
-			 * should be one!  Note previous tuple must be in window.
-			 */
-			slot = ExecProcNode(outerPlan);
-			if (TupIsNull(slot))
-				elog(ERROR, "LIMIT subplan failed to run backwards");
-			node->subSlot = slot;
-			node->lstate = LIMIT_INWINDOW;
-			/* position does not change 'cause we didn't advance it before */
-			break;
-
-		case LIMIT_WINDOWEND:
-			if (ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Backing up from window end: simply re-return the last tuple
-			 * fetched from the subplan.
-			 */
-			slot = node->subSlot;
-			node->lstate = LIMIT_INWINDOW;
-			/* position does not change 'cause we didn't advance it before */
-			break;
-
-		case LIMIT_WINDOWSTART:
-			if (!ScanDirectionIsForward(direction))
-				return NULL;
-
-			/*
-			 * Advancing after having backed off window start: simply
-			 * re-return the last tuple fetched from the subplan.
-			 */
-			slot = node->subSlot;
-			node->lstate = LIMIT_INWINDOW;
-			/* position does not change 'cause we didn't change it before */
-			break;
-
-		default:
-			elog(ERROR, "impossible LIMIT state: %d",
-				 (int) node->lstate);
-			slot = NULL;		/* keep compiler quiet */
-			break;
+		/*
+		 * First call for this node, so compute limit/offset. (We can't do
+		 * this any earlier, because parameters from upper nodes will not
+		 * be set during ExecInitLimit.) This also sets position = 0.
+		 */
+		recompute_limits(node);
+
+		/*
+		 * Check for empty window; if so, treat like empty subplan.
+		 */
+		if (!node->noCount && node->count <= 0)
+		{
+			node->lstate = LIMIT_DONE;
+			ExecPushNull(NULL, (PlanState *) node);
+			return false;
+		}
+
+		node->lstate = LIMIT_ACTIVE;
 	}
 
-	/* Return the current tuple */
-	Assert(!TupIsNull(slot));
+	if (++node->position <= node->offset)
+	{
+		/* we are not inside the window yet, wait for the next tuple */
+		return true;
+	}
+	/* Now we are sure that we are inside the window and this tuple has to be
+	   pushed */
+	parent_accepts_tuples = ExecPushTuple(slot, (PlanState *) node);
+	/* Probably OFFSET is exhausted */
+	last_tuple_pushed = !node->noCount &&
+		node->position - node->offset >= node->count;
+	limit_accepts_tuples = parent_accepts_tuples && !last_tuple_pushed;
+	if (!limit_accepts_tuples)
+		node->lstate = LIMIT_DONE;
+	return limit_accepts_tuples;
+}
 
-	return slot;
+/* NULL came from below, so this LIMIT is done	anyway */
+void
+ExecPushNullToLimit(TupleTableSlot *slot, LimitState *node)
+{
+	node->lstate = LIMIT_DONE;
+	ExecPushNull(slot, (PlanState *) node);
 }
 
 /*
@@ -290,9 +157,6 @@ recompute_limits(LimitState *node)
 	node->position = 0;
 	node->subSlot = NULL;
 
-	/* Set state-machine state */
-	node->lstate = LIMIT_RESCAN;
-
 	/* Notify child node about limit, if useful */
 	pass_down_bound(node, outerPlanState(node));
 }
@@ -361,7 +225,7 @@ pass_down_bound(LimitState *node, PlanState *child_node)
  * ----------------------------------------------------------------
  */
 LimitState *
-ExecInitLimit(Limit *node, EState *estate, int eflags)
+ExecInitLimit(Limit *node, EState *estate, int eflags, PlanState *parent)
 {
 	LimitState *limitstate;
 	Plan	   *outerPlan;
@@ -375,6 +239,7 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	limitstate = makeNode(LimitState);
 	limitstate->ps.plan = (Plan *) node;
 	limitstate->ps.state = estate;
+	limitstate->ps.parent = parent;
 
 	limitstate->lstate = LIMIT_INITIAL;
 
@@ -403,7 +268,8 @@ ExecInitLimit(Limit *node, EState *estate, int eflags)
 	 * then initialize outer plan
 	 */
 	outerPlan = outerPlan(node);
-	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
+	outerPlanState(limitstate) = ExecInitNode(outerPlan, estate, eflags,
+											  (PlanState *) limitstate);
 
 	/*
 	 * limit nodes do no projections, so initialize projection info for this
diff --git a/src/include/executor/nodeLimit.h b/src/include/executor/nodeLimit.h
index 6e4084b46d..ceefca0bf2 100644
--- a/src/include/executor/nodeLimit.h
+++ b/src/include/executor/nodeLimit.h
@@ -16,8 +16,10 @@
 
 #include "nodes/execnodes.h"
 
-extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecLimit(LimitState *node);
+extern LimitState *ExecInitLimit(Limit *node, EState *estate, int eflags,
+								 PlanState *parent);
+extern bool ExecPushTupleToLimit(TupleTableSlot *slot, LimitState *node);
+extern void ExecPushNullToLimit(TupleTableSlot *slot, LimitState *node);
 extern void ExecEndLimit(LimitState *node);
 extern void ExecReScanLimit(LimitState *node);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index abbe67ba0c..056db943b0 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2208,13 +2208,9 @@ typedef struct LockRowsState
  */
 typedef enum
 {
-	LIMIT_INITIAL,				/* initial state for LIMIT node */
-	LIMIT_RESCAN,				/* rescan after recomputing parameters */
-	LIMIT_EMPTY,				/* there are no returnable rows */
-	LIMIT_INWINDOW,				/* have returned a row in the window */
-	LIMIT_SUBPLANEOF,			/* at EOF of subplan (within window) */
-	LIMIT_WINDOWEND,			/* stepped off end of window */
-	LIMIT_WINDOWSTART			/* stepped off beginning of window */
+	LIMIT_INITIAL,		/* initial state for LIMIT node */
+	LIMIT_ACTIVE,		/* waiting for tuples */
+	LIMIT_DONE,			/* pushed all needed tuples */
 } LimitStateCond;
 
 typedef struct LimitState
-- 
2.11.0

0007-Reversed-hashed-Agg-implementation.patchtext/x-diffDownload
From 9822841ce1cbaa214869568b9dd3edbd35efb0d7 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Tue, 14 Mar 2017 15:26:55 +0300
Subject: [PATCH 7/8] Reversed hashed Agg implementation.

Only AGG_PLAIN and AGG_HASHED are reversed. The part relating to putting tuples
to hashtable is practically the same with hashtable lookups inlined.

To iterate over the hashtable effectively, method foreach was added to
simplehash.h. As in SeqScan or Hashjoin, the goal is to have only one loop
iterating over the tuples and sending them. We can have only one 'foreach' type
per hashtable in current implementation, this obviously should be changed if
needed.
---
 src/backend/executor/execGrouping.c |  75 ----
 src/backend/executor/execProcnode.c |  17 +
 src/backend/executor/nodeAgg.c      | 793 ++++++++++++++++++++++++++----------
 src/include/executor/executor.h     |  98 ++++-
 src/include/executor/nodeAgg.h      |   6 +-
 src/include/lib/simplehash.h        |  60 +++
 6 files changed, 760 insertions(+), 289 deletions(-)

diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c
index 4b1f634e21..7d5ae4aa04 100644
--- a/src/backend/executor/execGrouping.c
+++ b/src/backend/executor/execGrouping.c
@@ -51,81 +51,6 @@ static int	TupleHashTableMatch(struct tuplehash_hash *tb, const MinimalTuple tup
  *****************************************************************************/
 
 /*
- * execTuplesMatch
- *		Return true if two tuples match in all the indicated fields.
- *
- * This actually implements SQL's notion of "not distinct".  Two nulls
- * match, a null and a not-null don't match.
- *
- * slot1, slot2: the tuples to compare (must have same columns!)
- * numCols: the number of attributes to be examined
- * matchColIdx: array of attribute column numbers
- * eqFunctions: array of fmgr lookup info for the equality functions to use
- * evalContext: short-term memory context for executing the functions
- *
- * NB: evalContext is reset each time!
- */
-bool
-execTuplesMatch(TupleTableSlot *slot1,
-				TupleTableSlot *slot2,
-				int numCols,
-				AttrNumber *matchColIdx,
-				FmgrInfo *eqfunctions,
-				MemoryContext evalContext)
-{
-	MemoryContext oldContext;
-	bool		result;
-	int			i;
-
-	/* Reset and switch into the temp context. */
-	MemoryContextReset(evalContext);
-	oldContext = MemoryContextSwitchTo(evalContext);
-
-	/*
-	 * We cannot report a match without checking all the fields, but we can
-	 * report a non-match as soon as we find unequal fields.  So, start
-	 * comparing at the last field (least significant sort key). That's the
-	 * most likely to be different if we are dealing with sorted input.
-	 */
-	result = true;
-
-	for (i = numCols; --i >= 0;)
-	{
-		AttrNumber	att = matchColIdx[i];
-		Datum		attr1,
-					attr2;
-		bool		isNull1,
-					isNull2;
-
-		attr1 = slot_getattr(slot1, att, &isNull1);
-
-		attr2 = slot_getattr(slot2, att, &isNull2);
-
-		if (isNull1 != isNull2)
-		{
-			result = false;		/* one null and one not; they aren't equal */
-			break;
-		}
-
-		if (isNull1)
-			continue;			/* both are null, treat as equal */
-
-		/* Apply the type-specific equality function */
-
-		if (!DatumGetBool(FunctionCall2(&eqfunctions[i],
-										attr1, attr2)))
-		{
-			result = false;		/* they aren't equal */
-			break;
-		}
-	}
-
-	MemoryContextSwitchTo(oldContext);
-
-	return result;
-}
-
-/*
  * execTuplesUnequal
  *		Return true if two tuples are definitely unequal in the indicated
  *		fields.
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 1ebb0da36f..ab0312a4bf 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -193,6 +193,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 		/*
 		 * materialization nodes
 		 */
+		case T_Agg:
+			result = (PlanState *) ExecInitAgg((Agg *) node,
+											   estate, eflags, parent);
+			break;
+
 		case T_Hash:
 			result = (PlanState *) ExecInitHash((Hash *) node,
 												estate, eflags, parent);
@@ -282,6 +287,10 @@ ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
 
 	if (nodeTag(receiver) == T_LimitState)
 		return ExecPushTupleToLimit(slot, (LimitState *) receiver);
+
+	else if (nodeTag(receiver) == T_AggState)
+		return ExecPushTupleToAgg(slot, (AggState *) receiver);
+
 	else if (nodeTag(receiver) == T_HashState)
 		return ExecPushTupleToHash(slot, (HashState *) receiver);
 
@@ -324,6 +333,10 @@ ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
 
 	if (nodeTag(receiver) == T_LimitState)
 		return ExecPushNullToLimit(slot, (LimitState *) receiver);
+
+	else if (nodeTag(receiver) == T_AggState)
+		return ExecPushNullToAgg(slot, (AggState *) receiver);
+
 	else if (nodeTag(receiver) == T_HashState)
 		return ExecPushNullToHash(slot, (HashState *) receiver);
 
@@ -397,6 +410,10 @@ ExecEndNode(PlanState *node)
 		/*
 		 * materialization nodes
 		 */
+		case T_AggState:
+			ExecEndAgg((AggState *) node);
+			break;
+
 		case T_HashState:
 			ExecEndHash((HashState *) node);
 			break;
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index fa19358d19..8de9c0af53 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -153,6 +153,8 @@
 #include "postgres.h"
 
 #include "access/htup_details.h"
+#include "access/parallel.h"
+#include "access/hash.h"
 #include "catalog/objectaccess.h"
 #include "catalog/pg_aggregate.h"
 #include "catalog/pg_proc.h"
@@ -440,7 +442,6 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 }	AggStatePerPhaseData;
 
-
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
 static void initialize_aggregates(AggState *aggstate,
@@ -460,10 +461,10 @@ static void process_ordered_aggregate_single(AggState *aggstate,
 static void process_ordered_aggregate_multi(AggState *aggstate,
 								AggStatePerTrans pertrans,
 								AggStatePerGroup pergroupstate);
-static void finalize_aggregate(AggState *aggstate,
-				   AggStatePerAgg peragg,
-				   AggStatePerGroup pergroupstate,
-				   Datum *resultVal, bool *resultIsNull);
+static inline void finalize_aggregate(AggState *aggstate,
+									  AggStatePerAgg peragg,
+									  AggStatePerGroup pergroupstate,
+									  Datum *resultVal, bool *resultIsNull);
 static void finalize_partialaggregate(AggState *aggstate,
 						  AggStatePerAgg peragg,
 						  AggStatePerGroup pergroupstate,
@@ -471,19 +472,22 @@ static void finalize_partialaggregate(AggState *aggstate,
 static void prepare_projection_slot(AggState *aggstate,
 						TupleTableSlot *slot,
 						int currentSet);
-static void finalize_aggregates(AggState *aggstate,
-					AggStatePerAgg peragg,
-					AggStatePerGroup pergroup,
-					int currentSet);
+static inline void finalize_aggregates(AggState *aggstate,
+									   AggStatePerAgg peraggs,
+									   AggStatePerGroup pergroup,
+									   int currentSet);
 static TupleTableSlot *project_aggregates(AggState *aggstate);
+static inline bool project_aggregates_and_push(AggState *aggstate);
+static inline bool AggPushHashEntry(TupleHashEntry entry, void *astate);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_table(AggState *aggstate);
-static TupleHashEntryData *lookup_hash_entry(AggState *aggstate,
-				  TupleTableSlot *inputslot);
+static inline TupleHashEntryData *lookup_hash_entry(AggState *aggstate,
+													TupleTableSlot *inputslot);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_fill_hash_table(AggState *aggstate);
-static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
+static void agg_puttup_plain(AggState *aggstate, TupleTableSlot *outerslot);
+static bool agg_finalize_plain(AggState *aggstate);
+static void agg_puttup_hash_table(AggState *aggstate, TupleTableSlot *outerslot);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
 						  AggState *aggstate, EState *estate,
@@ -498,6 +502,39 @@ static int find_compatible_pertrans(AggState *aggstate, Aggref *newagg,
 						 Oid aggserialfn, Oid aggdeserialfn,
 						 Datum initValue, bool initValueIsNull,
 						 List *transnos);
+/*
+ * We use our own hash table instead of defined in execGrouping.c, see notes
+ * below.
+ */
+/* define parameters necessary to generate the tuple hash table interface */
+#define SH_PREFIX aggtuplehash
+#define SH_ELEMENT_TYPE TupleHashEntryData
+#define SH_KEY_TYPE MinimalTuple
+#define SH_SCOPE static inline
+#define SH_FOREACH_ON
+#define SH_FOREACH_ACC_TYPE bool
+#define SH_DECLARE
+#include "lib/simplehash.h"
+static inline bool inline_and(bool old, bool new);
+
+/*
+ * And our own copies of funcs from execGrouping.c
+ */
+static TupleHashTable BuildAggTupleHashTable(int numCols, AttrNumber *keyColIdx,
+											 FmgrInfo *eqfunctions,
+											 FmgrInfo *hashfunctions,
+											 long nbuckets, Size additionalsize,
+											 MemoryContext tablecxt,
+											 MemoryContext tempcxt,
+											 bool use_variable_hash_iv);
+static inline TupleHashEntry LookupAggTupleHashEntry(TupleHashTable hashtable,
+													 TupleTableSlot *slot,
+													 bool *isnew);
+static inline uint32 AggTupleHashTableHash(struct aggtuplehash_hash *tb,
+										   const MinimalTuple tuple);
+static inline int AggTupleHashTableMatch(struct aggtuplehash_hash *tb,
+										 const MinimalTuple tuple1,
+										 const MinimalTuple tuple2);
 
 
 /*
@@ -1573,7 +1610,6 @@ finalize_aggregates(AggState *aggstate,
 	Datum	   *aggvalues = econtext->ecxt_aggvalues;
 	bool	   *aggnulls = econtext->ecxt_aggnulls;
 	int			aggno;
-	int			transno;
 
 	Assert(currentSet == 0 ||
 		   ((Agg *) aggstate->ss.ps.plan)->aggstrategy != AGG_HASHED);
@@ -1581,32 +1617,6 @@ finalize_aggregates(AggState *aggstate,
 	aggstate->current_set = currentSet;
 
 	/*
-	 * If there were any DISTINCT and/or ORDER BY aggregates, sort their
-	 * inputs and run the transition functions.
-	 */
-	for (transno = 0; transno < aggstate->numtrans; transno++)
-	{
-		AggStatePerTrans pertrans = &aggstate->pertrans[transno];
-		AggStatePerGroup pergroupstate;
-
-		pergroupstate = &pergroup[transno + (currentSet * (aggstate->numtrans))];
-
-		if (pertrans->numSortCols > 0)
-		{
-			Assert(((Agg *) aggstate->ss.ps.plan)->aggstrategy != AGG_HASHED);
-
-			if (pertrans->numInputs == 1)
-				process_ordered_aggregate_single(aggstate,
-												 pertrans,
-												 pergroupstate);
-			else
-				process_ordered_aggregate_multi(aggstate,
-												pertrans,
-												pergroupstate);
-		}
-	}
-
-	/*
 	 * Run the final functions.
 	 */
 	for (aggno = 0; aggno < aggstate->numaggs; aggno++)
@@ -1617,12 +1627,8 @@ finalize_aggregates(AggState *aggstate,
 
 		pergroupstate = &pergroup[transno + (currentSet * (aggstate->numtrans))];
 
-		if (DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit))
-			finalize_partialaggregate(aggstate, peragg, pergroupstate,
-									  &aggvalues[aggno], &aggnulls[aggno]);
-		else
-			finalize_aggregate(aggstate, peragg, pergroupstate,
-							   &aggvalues[aggno], &aggnulls[aggno]);
+		finalize_aggregate(aggstate, peragg, pergroupstate,
+						   &aggvalues[aggno], &aggnulls[aggno]);
 	}
 }
 
@@ -1654,6 +1660,105 @@ project_aggregates(AggState *aggstate)
 }
 
 /*
+ * Project the result of a group (whose aggs have already been calculated by
+ * finalize_aggregates), and push all tuples. Returns true if all tuples were
+ * pushed, false if the parent doesn't want to accept tuples anymore.
+ */
+static inline bool
+project_aggregates_and_push(AggState *aggstate)
+{
+	ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+
+	/*
+	 * Check the qual (HAVING clause); if the group does not match, ignore it.
+	 */
+	if (ExecQual(aggstate->ss.ps.qual, econtext, false))
+	{
+		/*
+		 * Form and return or store a projection tuple using the aggregate
+		 * results and the representative input tuple.
+		 */
+		TupleTableSlot *slot;
+
+		slot = ExecProject(aggstate->ss.ps.ps_ProjInfo);
+		return ExecPushTuple(slot, (PlanState *) aggstate);
+
+	}
+	else
+		InstrCountFiltered1(aggstate, 1);
+
+	return true;
+}
+
+/*
+ * Finalize one TupleHashEntry, project the result and push it. Returns true
+ * if all tuples are were pushed, false if the parent doesn't want to accept
+ * tuples anymore.
+ */
+static inline bool
+AggPushHashEntry(TupleHashEntry entry, void *astate)
+{
+	AggState *aggstate = (AggState *) astate;
+	ExprContext *econtext;
+	AggStatePerAgg peragg;
+	AggStatePerGroup pergroup;
+	TupleTableSlot *firstSlot;
+	TupleTableSlot *hashslot;
+	int i;
+
+	/*
+	 * get state info from node
+	 */
+	/* econtext is the per-output-tuple expression context */
+	econtext = aggstate->ss.ps.ps_ExprContext;
+	peragg = aggstate->peragg;
+	firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	hashslot = aggstate->hashslot;
+
+	/*
+	 * Clear the per-output-tuple context for each group
+	 *
+	 * We intentionally don't use ReScanExprContext here; if any aggs have
+	 * registered shutdown callbacks, they mustn't be called yet, since we
+	 * might not be done with that agg.
+	 */
+	ResetExprContext(econtext);
+
+	/*
+	 * Store the copied first input tuple in the tuple table slot reserved
+	 * for it, so that it can be used in ExecProject.
+	 */
+	ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
+	slot_getallattrs(hashslot);
+
+	ExecClearTuple(firstSlot);
+	memset(firstSlot->tts_isnull, true,
+		   firstSlot->tts_tupleDescriptor->natts * sizeof(bool));
+
+	for (i = 0; i < aggstate->numhashGrpCols; i++)
+	{
+		int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
+
+		firstSlot->tts_values[varNumber] = hashslot->tts_values[i];
+		firstSlot->tts_isnull[varNumber] = hashslot->tts_isnull[i];
+	}
+	ExecStoreVirtualTuple(firstSlot);
+
+	pergroup = (AggStatePerGroup) entry->additional;
+
+	finalize_aggregates(aggstate, peragg, pergroup, 0);
+
+	/*
+	 * Use the representative input tuple for any references to
+	 * non-aggregated input columns in the qual and tlist.
+	 */
+	econtext->ecxt_outertuple = firstSlot;
+
+	return project_aggregates_and_push(aggstate);
+}
+
+
+/*
  * find_unaggregated_cols
  *	  Construct a bitmapset of the column numbers of un-aggregated Vars
  *	  appearing in our targetlist and qual (HAVING clause)
@@ -1719,12 +1824,12 @@ build_hash_table(AggState *aggstate)
 
 	additionalsize = aggstate->numaggs * sizeof(AggStatePerGroupData);
 
-	aggstate->hashtable = BuildTupleHashTable(node->numCols,
-											  aggstate->hashGrpColIdxHash,
-											  aggstate->phase->eqfunctions,
-											  aggstate->hashfunctions,
-											  node->numGroups,
-											  additionalsize,
+	aggstate->hashtable = BuildAggTupleHashTable(node->numCols,
+												 aggstate->hashGrpColIdxHash,
+												 aggstate->phase->eqfunctions,
+												 aggstate->hashfunctions,
+												 node->numGroups,
+												 additionalsize,
 							 aggstate->aggcontexts[0]->ecxt_per_tuple_memory,
 											  tmpmem,
 								  DO_AGGSPLIT_SKIPFINAL(aggstate->aggsplit));
@@ -1845,7 +1950,7 @@ hash_agg_entry_size(int numAggs)
  *
  * When called, CurrentMemoryContext should be the per-query context.
  */
-static TupleHashEntryData *
+static inline TupleHashEntryData *
 lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 {
 	TupleTableSlot *hashslot = aggstate->hashslot;
@@ -1867,7 +1972,7 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 	ExecStoreVirtualTuple(hashslot);
 
 	/* find or create the hashtable entry using the filtered tuple */
-	entry = LookupTupleHashEntry(aggstate->hashtable, hashslot, &isnew);
+	entry = LookupAggTupleHashEntry(aggstate->hashtable, hashslot, &isnew);
 
 	if (isnew)
 	{
@@ -1883,43 +1988,121 @@ lookup_hash_entry(AggState *aggstate, TupleTableSlot *inputslot)
 }
 
 /*
- * ExecAgg -
- *
- *	  ExecAgg receives tuples from its outer subplan and aggregates over
- *	  the appropriate attribute for each aggregate function use (Aggref
- *	  node) appearing in the targetlist or qual of the node.  The number
- *	  of tuples to aggregate over depends on whether grouped or plain
- *	  aggregation is selected.  In grouped aggregation, we produce a result
- *	  row for each group; in plain aggregation there's a single result row
- *	  for the whole query.  In either case, the value of each aggregate is
- *	  stored in the expression context to be used when ExecProject evaluates
- *	  the result tuple.
+ * ExecPushTupleToAgg -
+ *
+ *	  pushTupleToAgg receives a tuple from its outer subplan and aggregates it
+ *	  over the appropriate attribute for each aggregate function use (Aggref
+ *	  node) appearing in the targetlist or qual of the node.  The number of
+ *	  tuples to aggregate over depends on whether grouped or plain aggregation
+ *	  is selected.  In grouped aggregation, we produce a result row for each
+ *	  group; in plain aggregation there's a single result row for the whole
+ *	  query.  In either case, the value of each aggregate is stored in the
+ *	  expression context to be used when ExecProject evaluates the result
+ *	  tuple.
  */
-TupleTableSlot *
-ExecAgg(AggState *node)
+bool
+ExecPushTupleToAgg(TupleTableSlot *slot, AggState *node)
 {
-	TupleTableSlot *result;
+	AggStrategy strategy = node->phase->aggnode->aggstrategy;
+	/* Only AGG_HASHED and AGG_PLAIN is supported at the moment */
+	Assert(strategy == AGG_HASHED || strategy == AGG_PLAIN);
+	/* AGGSPLIT is not supported at the moment */
+	Assert(node->aggsplit == AGGSPLIT_SIMPLE);
+	/* neither AGG_HASHED nor AGG_PLAIN support multiple grouping sets */
+	Assert(node->phase->numsets == 0);
+	Assert(!node->agg_done);
+
+	if (strategy == AGG_PLAIN)
+		agg_puttup_plain(node, slot);
+	else
+		agg_puttup_hash_table(node, slot);
+
+	return true;
+}
+
+/* NULL tuple arrived, finalize aggregation and push tuples */
+void
+ExecPushNullToAgg(TupleTableSlot *slot, AggState *node)
+{
+	AggStrategy strategy = node->phase->aggnode->aggstrategy;
+	bool parent_accepts_tuples;
 
-	if (!node->agg_done)
+	if (strategy == AGG_PLAIN)
 	{
-		/* Dispatch based on strategy */
-		switch (node->phase->aggnode->aggstrategy)
-		{
-			case AGG_HASHED:
-				if (!node->table_filled)
-					agg_fill_hash_table(node);
-				result = agg_retrieve_hash_table(node);
-				break;
-			default:
-				result = agg_retrieve_direct(node);
-				break;
-		}
+		parent_accepts_tuples = agg_finalize_plain(node);
+	}
+	else
+	{
+		node->table_filled = true;
+		/* For each tuple in hashtable, push it */
+		parent_accepts_tuples = aggtuplehash_foreach(
+			(aggtuplehash_hash *) node->hashtable->hashtab, node);
+	}
 
-		if (!TupIsNull(result))
-			return result;
+	if (parent_accepts_tuples)
+		/* If parent still waits for tuples, let it know we are done */
+		ExecPushNull(NULL, (PlanState *) node);
+
+	node->agg_done = true;
+}
+
+/* advance aggregates for arrived tuple in AGG_PLAIN */
+static void
+agg_puttup_plain(AggState *aggstate, TupleTableSlot *slot)
+{
+	AggStatePerGroup pergroup = aggstate->pergroup;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+
+	if (TupIsNull(firstSlot))
+	{
+		/*
+		 * First call of agg_puttup_plain
+		 */
+		int numReset = 1; /*always one grouping set in AGG_PLAIN */
+
+		/* always first and only grouping set in AGG_PLAIN */
+		aggstate->projected_set = 0;
+
+		/*
+		 * Initialize working state for a new input tuple group.
+		 */
+		initialize_aggregates(aggstate, pergroup, numReset);
+
+		/*
+		 * Store the copied first input tuple in the tuple table slot
+		 * reserved for it.	 The tuple will be deleted when it is
+		 * cleared from the slot.
+		 */
+		ExecStoreTuple(ExecCopySlotTuple(slot),
+					   firstSlot,
+					   InvalidBuffer,
+					   true);
 	}
 
-	return NULL;
+	/* set up for advance_aggregates call */
+	tmpcontext->ecxt_outertuple = slot;
+
+	advance_aggregates(aggstate, pergroup);
+}
+
+/* finalize and push AGG_PLAIN */
+static bool
+agg_finalize_plain(AggState *aggstate)
+{
+	AggStatePerGroup pergroup = aggstate->pergroup;
+	AggStatePerAgg peragg = aggstate->peragg;
+	TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	int			currentSet;
+
+	if (TupIsNull(firstSlot))
+		/* agg_puttup_plain was never called */
+		return true;
+
+	currentSet = aggstate->projected_set;
+	prepare_projection_slot(aggstate, firstSlot, currentSet);
+	finalize_aggregates(aggstate, peragg, pergroup, currentSet);
+	return project_aggregates_and_push(aggstate);
 }
 
 /*
@@ -2247,141 +2430,31 @@ agg_retrieve_direct(AggState *aggstate)
 }
 
 /*
- * ExecAgg for hashed case: phase 1, read input and build hash table
+ * ExecPushTupleToAgg for hashed case, add one tuple to the hashtable
  */
 static void
-agg_fill_hash_table(AggState *aggstate)
+agg_puttup_hash_table(AggState *aggstate, TupleTableSlot *outerslot)
 {
 	ExprContext *tmpcontext;
 	TupleHashEntryData *entry;
-	TupleTableSlot *outerslot;
 
 	/*
-	 * get state info from node
-	 *
 	 * tmpcontext is the per-input-tuple expression context
 	 */
 	tmpcontext = aggstate->tmpcontext;
 
-	/*
-	 * Process each outer-plan tuple, and then fetch the next one, until we
-	 * exhaust the outer plan.
-	 */
-	for (;;)
-	{
-		outerslot = fetch_input_tuple(aggstate);
-		if (TupIsNull(outerslot))
-			break;
-		/* set up for advance_aggregates call */
-		tmpcontext->ecxt_outertuple = outerslot;
-
-		/* Find or build hashtable entry for this tuple's group */
-		entry = lookup_hash_entry(aggstate, outerslot);
-
-		/* Advance the aggregates */
-		if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
-			combine_aggregates(aggstate, (AggStatePerGroup) entry->additional);
-		else
-			advance_aggregates(aggstate, (AggStatePerGroup) entry->additional);
-
-		/* Reset per-input-tuple context after each tuple */
-		ResetExprContext(tmpcontext);
-	}
-
-	aggstate->table_filled = true;
-	/* Initialize to walk the hash table */
-	ResetTupleHashIterator(aggstate->hashtable, &aggstate->hashiter);
-}
-
-/*
- * ExecAgg for hashed case: phase 2, retrieving groups from hash table
- */
-static TupleTableSlot *
-agg_retrieve_hash_table(AggState *aggstate)
-{
-	ExprContext *econtext;
-	AggStatePerAgg peragg;
-	AggStatePerGroup pergroup;
-	TupleHashEntryData *entry;
-	TupleTableSlot *firstSlot;
-	TupleTableSlot *result;
-	TupleTableSlot *hashslot;
-
-	/*
-	 * get state info from node
-	 */
-	/* econtext is the per-output-tuple expression context */
-	econtext = aggstate->ss.ps.ps_ExprContext;
-	peragg = aggstate->peragg;
-	firstSlot = aggstate->ss.ss_ScanTupleSlot;
-	hashslot = aggstate->hashslot;
-
 
-	/*
-	 * We loop retrieving groups until we find one satisfying
-	 * aggstate->ss.ps.qual
-	 */
-	while (!aggstate->agg_done)
-	{
-		int i;
+	/* set up for advance_aggregates call */
+	tmpcontext->ecxt_outertuple = outerslot;
 
-		/*
-		 * Find the next entry in the hash table
-		 */
-		entry = ScanTupleHashTable(aggstate->hashtable, &aggstate->hashiter);
-		if (entry == NULL)
-		{
-			/* No more entries in hashtable, so done */
-			aggstate->agg_done = TRUE;
-			return NULL;
-		}
-
-		/*
-		 * Clear the per-output-tuple context for each group
-		 *
-		 * We intentionally don't use ReScanExprContext here; if any aggs have
-		 * registered shutdown callbacks, they mustn't be called yet, since we
-		 * might not be done with that agg.
-		 */
-		ResetExprContext(econtext);
-
-		/*
-		 * Transform representative tuple back into one with the right
-		 * columns.
-		 */
-		ExecStoreMinimalTuple(entry->firstTuple, hashslot, false);
-		slot_getallattrs(hashslot);
-
-		ExecClearTuple(firstSlot);
-		memset(firstSlot->tts_isnull, true,
-			   firstSlot->tts_tupleDescriptor->natts * sizeof(bool));
-
-		for (i = 0; i < aggstate->numhashGrpCols; i++)
-		{
-			int			varNumber = aggstate->hashGrpColIdxInput[i] - 1;
-
-			firstSlot->tts_values[varNumber] = hashslot->tts_values[i];
-			firstSlot->tts_isnull[varNumber] = hashslot->tts_isnull[i];
-		}
-		ExecStoreVirtualTuple(firstSlot);
-
-		pergroup = (AggStatePerGroup) entry->additional;
-
-		finalize_aggregates(aggstate, peragg, pergroup, 0);
-
-		/*
-		 * Use the representative input tuple for any references to
-		 * non-aggregated input columns in the qual and tlist.
-		 */
-		econtext->ecxt_outertuple = firstSlot;
+	/* Find or build hashtable entry for this tuple's group */
+	entry = lookup_hash_entry(aggstate, outerslot);
 
-		result = project_aggregates(aggstate);
-		if (result)
-			return result;
-	}
+	/* Advance the aggregates */
+	advance_aggregates(aggstate, (AggStatePerGroup) entry->additional);
 
-	/* No more groups */
-	return NULL;
+	/* Reset per-input-tuple context after each tuple */
+	ResetExprContext(tmpcontext);
 }
 
 /* -----------------
@@ -2392,7 +2465,7 @@ agg_retrieve_hash_table(AggState *aggstate)
  * -----------------
  */
 AggState *
-ExecInitAgg(Agg *node, EState *estate, int eflags)
+ExecInitAgg(Agg *node, EState *estate, int eflags, PlanState *parent)
 {
 	AggState   *aggstate;
 	AggStatePerAgg peraggs;
@@ -2421,6 +2494,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate = makeNode(AggState);
 	aggstate->ss.ps.plan = (Plan *) node;
 	aggstate->ss.ps.state = estate;
+	aggstate->ss.ps.parent = parent;
 
 	aggstate->aggs = NIL;
 	aggstate->numaggs = 0;
@@ -2523,7 +2597,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	if (node->aggstrategy == AGG_HASHED)
 		eflags &= ~EXEC_FLAG_REWIND;
 	outerPlan = outerPlan(node);
-	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags, NULL);
+	outerPlanState(aggstate) = ExecInitNode(outerPlan, estate, eflags,
+											(PlanState *) aggstate);
 
 	/*
 	 * initialize source tuple type.
@@ -3780,3 +3855,309 @@ aggregate_dummy(PG_FUNCTION_ARGS)
 		 fcinfo->flinfo->fn_oid);
 	return (Datum) 0;			/* keep compiler quiet */
 }
+
+/*
+ * We want to use our own hashtable instead of defined in execGrouping.c,
+ * because
+ * - we want to inline its interface functions
+ * - we want 'foreach' method with inlined action
+ *
+ * While we need new hashtable, stored type (TupleHashEntry) is exactly the
+ * same. Because of that types (tuplehash_hash *) and (aggtuplehash_hash *)
+ * are fully compatible. So, we won't change type of aggstate->hashtable to
+ * copy-pasted TupleHashTableData with the only field hashtab changed to
+ * aggtuplehas_hash *; instead, we will use casts where needed.
+ *
+ * Since functions in execGrouping.c are hard-linked with `tuplehash`
+ * hashtable defined there, we can't use them and need our own versions too,
+ * so they will be basically copypasted with changed hashtable name.  Of
+ * course, it is no good, but again, our goal for now is to estimate
+ * perfomance benefits. Later, if needed, execGrouping may be generalized to
+ * handle any hashtable.
+ *
+ */
+
+/*
+ * Define parameters for tuple hash table code generation.
+ */
+#define SH_PREFIX aggtuplehash
+#define SH_ELEMENT_TYPE TupleHashEntryData
+#define SH_KEY_TYPE MinimalTuple
+#define SH_KEY firstTuple
+#define SH_HASH_KEY(tb, key) AggTupleHashTableHash(tb, key)
+#define SH_EQUAL(tb, a, b) AggTupleHashTableMatch(tb, a, b) == 0
+#define SH_SCOPE static inline
+#define SH_STORE_HASH
+#define SH_GET_HASH(tb, a) a->hash
+#define SH_FOREACH_ON
+#define SH_FOREACH_ACC_TYPE bool
+#define SH_FOREACH_ACC_INIT true
+#define SH_FOREACH_FUNC AggPushHashEntry
+#define SH_FOREACH_ACC_FUNC inline_and
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+static inline bool inline_and(bool old, bool new)
+{
+	return old && new;
+}
+
+/*
+ * Funcs from execGrouping.c
+ */
+
+/*
+ * Construct an empty TupleHashTable
+ *
+ *	numCols, keyColIdx: identify the tuple fields to use as lookup key
+ *	eqfunctions: equality comparison functions to use
+ *	hashfunctions: datatype-specific hashing functions to use
+ *	nbuckets: initial estimate of hashtable size
+ *	additionalsize: size of data stored in ->additional
+ *	tablecxt: memory context in which to store table and table entries
+ *	tempcxt: short-lived context for evaluation hash and comparison functions
+ *
+ * The function arrays may be made with execTuplesHashPrepare().  Note they
+ * are not cross-type functions, but expect to see the table datatype(s)
+ * on both sides.
+ *
+ * Note that keyColIdx, eqfunctions, and hashfunctions must be allocated in
+ * storage that will live as long as the hashtable does.
+ */
+static TupleHashTable
+BuildAggTupleHashTable(int numCols, AttrNumber *keyColIdx,
+					FmgrInfo *eqfunctions,
+					FmgrInfo *hashfunctions,
+					long nbuckets, Size additionalsize,
+					MemoryContext tablecxt, MemoryContext tempcxt,
+					bool use_variable_hash_iv)
+{
+	TupleHashTable hashtable;
+	Size		entrysize = sizeof(TupleHashEntryData) + additionalsize;
+
+	Assert(nbuckets > 0);
+
+	/* Limit initial table size request to not more than work_mem */
+	nbuckets = Min(nbuckets, (long) ((work_mem * 1024L) / entrysize));
+
+	hashtable = (TupleHashTable)
+		MemoryContextAlloc(tablecxt, sizeof(TupleHashTableData));
+
+	hashtable->numCols = numCols;
+	hashtable->keyColIdx = keyColIdx;
+	hashtable->tab_hash_funcs = hashfunctions;
+	hashtable->tab_eq_funcs = eqfunctions;
+	hashtable->tablecxt = tablecxt;
+	hashtable->tempcxt = tempcxt;
+	hashtable->entrysize = entrysize;
+	hashtable->tableslot = NULL;	/* will be made on first lookup */
+	hashtable->inputslot = NULL;
+	hashtable->in_hash_funcs = NULL;
+	hashtable->cur_eq_funcs = NULL;
+
+	/*
+	 * If parallelism is in use, even if the master backend is performing the
+	 * scan itself, we don't want to create the hashtable exactly the same way
+	 * in all workers. As hashtables are iterated over in keyspace-order,
+	 * doing so in all processes in the same way is likely to lead to
+	 * "unbalanced" hashtables when the table size initially is
+	 * underestimated.
+	 */
+	if (use_variable_hash_iv)
+		hashtable->hash_iv = hash_uint32(ParallelWorkerNumber);
+	else
+		hashtable->hash_iv = 0;
+
+	hashtable->hashtab = (tuplehash_hash*) aggtuplehash_create(tablecxt,
+															   nbuckets,
+															   hashtable);
+
+	return hashtable;
+}
+
+/*
+ * Find or create a hashtable entry for the tuple group containing the
+ * given tuple.  The tuple must be the same type as the hashtable entries.
+ *
+ * If isnew is NULL, we do not create new entries; we return NULL if no
+ * match is found.
+ *
+ * If isnew isn't NULL, then a new entry is created if no existing entry
+ * matches.  On return, *isnew is true if the entry is newly created,
+ * false if it existed already.  ->additional_data in the new entry has
+ * been zeroed.
+ */
+static inline TupleHashEntry
+LookupAggTupleHashEntry(TupleHashTable hashtable, TupleTableSlot *slot,
+					 bool *isnew)
+{
+	TupleHashEntryData *entry;
+	MemoryContext oldContext;
+	bool		found;
+	MinimalTuple key;
+
+	/* If first time through, clone the input slot to make table slot */
+	if (hashtable->tableslot == NULL)
+	{
+		TupleDesc	tupdesc;
+
+		oldContext = MemoryContextSwitchTo(hashtable->tablecxt);
+
+		/*
+		 * We copy the input tuple descriptor just for safety --- we assume
+		 * all input tuples will have equivalent descriptors.
+		 */
+		tupdesc = CreateTupleDescCopy(slot->tts_tupleDescriptor);
+		hashtable->tableslot = MakeSingleTupleTableSlot(tupdesc);
+		MemoryContextSwitchTo(oldContext);
+	}
+
+	/* Need to run the hash functions in short-lived context */
+	oldContext = MemoryContextSwitchTo(hashtable->tempcxt);
+
+	/* set up data needed by hash and match functions */
+	hashtable->inputslot = slot;
+	hashtable->in_hash_funcs = hashtable->tab_hash_funcs;
+	hashtable->cur_eq_funcs = hashtable->tab_eq_funcs;
+
+	key = NULL; /* flag to reference inputslot */
+
+	if (isnew)
+	{
+		entry = aggtuplehash_insert((aggtuplehash_hash *) hashtable->hashtab,
+									key, &found);
+
+		if (found)
+		{
+			/* found pre-existing entry */
+			*isnew = false;
+		}
+		else
+		{
+			/* created new entry */
+			*isnew = true;
+			/* zero caller data */
+			entry->additional = NULL;
+			MemoryContextSwitchTo(hashtable->tablecxt);
+			/* Copy the first tuple into the table context */
+			entry->firstTuple = ExecCopySlotMinimalTuple(slot);
+		}
+	}
+	else
+	{
+		entry = aggtuplehash_lookup((aggtuplehash_hash *) hashtable->hashtab,
+									key);
+	}
+
+	MemoryContextSwitchTo(oldContext);
+
+	return entry;
+}
+
+/*
+ * Compute the hash value for a tuple
+ *
+ * The passed-in key is a pointer to TupleHashEntryData.  In an actual hash
+ * table entry, the firstTuple field points to a tuple (in MinimalTuple
+ * format).  LookupTupleHashEntry sets up a dummy TupleHashEntryData with a
+ * NULL firstTuple field --- that cues us to look at the inputslot instead.
+ * This convention avoids the need to materialize virtual input tuples unless
+ * they actually need to get copied into the table.
+ *
+ * Also, the caller must select an appropriate memory context for running
+ * the hash functions.
+ */
+static inline uint32
+AggTupleHashTableHash(struct aggtuplehash_hash *tb, const MinimalTuple tuple)
+{
+	TupleHashTable hashtable = (TupleHashTable) tb->private_data;
+	int			numCols = hashtable->numCols;
+	AttrNumber *keyColIdx = hashtable->keyColIdx;
+	uint32		hashkey = hashtable->hash_iv;
+	TupleTableSlot *slot;
+	FmgrInfo   *hashfunctions;
+	int			i;
+
+	if (tuple == NULL)
+	{
+		/* Process the current input tuple for the table */
+		slot = hashtable->inputslot;
+		hashfunctions = hashtable->in_hash_funcs;
+	}
+	else
+	{
+		/*
+		 * Process a tuple already stored in the table.
+		 *
+		 * (this case never actually occurs due to the way simplehash.h is
+		 * used, as the hash-value is stored in the entries)
+		 */
+		slot = hashtable->tableslot;
+		ExecStoreMinimalTuple(tuple, slot, false);
+		hashfunctions = hashtable->tab_hash_funcs;
+	}
+
+	for (i = 0; i < numCols; i++)
+	{
+		AttrNumber	att = keyColIdx[i];
+		Datum		attr;
+		bool		isNull;
+
+		/* rotate hashkey left 1 bit at each step */
+		hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
+
+		attr = slot_getattr(slot, att, &isNull);
+
+		if (!isNull)			/* treat nulls as having hash key 0 */
+		{
+			uint32		hkey;
+
+			hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i],
+												attr));
+			hashkey ^= hkey;
+		}
+	}
+
+	return hashkey;
+}
+
+/*
+ * See whether two tuples (presumably of the same hash value) match
+ *
+ * As above, the passed pointers are pointers to TupleHashEntryData.
+ *
+ * Also, the caller must select an appropriate memory context for running
+ * the compare functions.
+ */
+static inline int
+AggTupleHashTableMatch(struct aggtuplehash_hash *tb,
+					   const MinimalTuple tuple1,
+					   const MinimalTuple tuple2)
+{
+	TupleTableSlot *slot1;
+	TupleTableSlot *slot2;
+	TupleHashTable hashtable = (TupleHashTable) tb->private_data;
+
+	/*
+	 * We assume that simplehash.h will only ever call us with the first
+	 * argument being an actual table entry, and the second argument being
+	 * LookupTupleHashEntry's dummy TupleHashEntryData.  The other direction
+	 * could be supported too, but is not currently required.
+	 */
+	Assert(tuple1 != NULL);
+	slot1 = hashtable->tableslot;
+	ExecStoreMinimalTuple(tuple1, slot1, false);
+	Assert(tuple2 == NULL);
+	slot2 = hashtable->inputslot;
+
+	/* For crosstype comparisons, the inputslot must be first */
+	if (execTuplesMatch(slot2,
+						slot1,
+						hashtable->numCols,
+						hashtable->keyColIdx,
+						hashtable->cur_eq_funcs,
+						hashtable->tempcxt))
+		return 0;
+	else
+		return 1;
+}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 386fcb4c8b..af8e98f2b4 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -17,6 +17,7 @@
 #include "catalog/partition.h"
 #include "executor/execdesc.h"
 #include "nodes/parsenodes.h"
+#include "utils/memutils.h"
 
 
 /*
@@ -121,12 +122,6 @@ extern bool execCurrentOf(CurrentOfExpr *cexpr,
 /*
  * prototypes from functions in execGrouping.c
  */
-extern bool execTuplesMatch(TupleTableSlot *slot1,
-				TupleTableSlot *slot2,
-				int numCols,
-				AttrNumber *matchColIdx,
-				FmgrInfo *eqfunctions,
-				MemoryContext evalContext);
 extern bool execTuplesUnequal(TupleTableSlot *slot1,
 				  TupleTableSlot *slot2,
 				  int numCols,
@@ -417,4 +412,95 @@ extern void ExecSimpleRelationDelete(EState *estate, EPQState *epqstate,
 extern void CheckCmdReplicaIdentity(Relation rel, CmdType cmd);
 
 
+/*
+ * Below goes static inlined function moved from execGrouping.c: since
+ * we have inlined all hashtable interface functions in nodeAgg.c, why not
+ * inline execTuplesMatch too?
+ * Obviously this is not a good place for it, it should be moved to
+ * something like execGrouping.h and all calls updated.
+ */
+
+static inline bool execTuplesMatch(TupleTableSlot *slot1,
+								   TupleTableSlot *slot2,
+								   int numCols,
+								   AttrNumber *matchColIdx,
+								   FmgrInfo *eqfunctions,
+								   MemoryContext evalContext);
+
+/*
+ * execTuplesMatch
+ *		Return true if two tuples match in all the indicated fields.
+ *
+ * This actually implements SQL's notion of "not distinct".  Two nulls
+ * match, a null and a not-null don't match.
+ *
+ * slot1, slot2: the tuples to compare (must have same columns!)
+ * numCols: the number of attributes to be examined
+ * matchColIdx: array of attribute column numbers
+ * eqFunctions: array of fmgr lookup info for the equality functions to use
+ * evalContext: short-term memory context for executing the functions
+ *
+ * NB: evalContext is reset each time!
+ */
+static inline bool
+execTuplesMatch(TupleTableSlot *slot1,
+				TupleTableSlot *slot2,
+				int numCols,
+				AttrNumber *matchColIdx,
+				FmgrInfo *eqfunctions,
+				MemoryContext evalContext)
+{
+	MemoryContext oldContext;
+	bool		result;
+	int			i;
+
+	/* Reset and switch into the temp context. */
+	MemoryContextReset(evalContext);
+	oldContext = MemoryContextSwitchTo(evalContext);
+
+	/*
+	 * We cannot report a match without checking all the fields, but we can
+	 * report a non-match as soon as we find unequal fields.  So, start
+	 * comparing at the last field (least significant sort key). That's the
+	 * most likely to be different if we are dealing with sorted input.
+	 */
+	result = true;
+
+	for (i = numCols; --i >= 0;)
+	{
+		AttrNumber	att = matchColIdx[i];
+		Datum		attr1,
+					attr2;
+		bool		isNull1,
+					isNull2;
+
+		attr1 = slot_getattr(slot1, att, &isNull1);
+
+		attr2 = slot_getattr(slot2, att, &isNull2);
+
+		if (isNull1 != isNull2)
+		{
+			result = false;		/* one null and one not; they aren't equal */
+			break;
+		}
+
+		if (isNull1)
+			continue;			/* both are null, treat as equal */
+
+		/* Apply the type-specific equality function */
+
+		if (!DatumGetBool(FunctionCall2(&eqfunctions[i],
+										attr1, attr2)))
+		{
+			result = false;		/* they aren't equal */
+			break;
+		}
+	}
+
+	MemoryContextSwitchTo(oldContext);
+
+	return result;
+}
+
+
 #endif   /* EXECUTOR_H  */
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index d2fee52e12..0808bfc75a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -16,8 +16,10 @@
 
 #include "nodes/execnodes.h"
 
-extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecAgg(AggState *node);
+extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags,
+							 PlanState *parent);
+extern bool ExecPushTupleToAgg(TupleTableSlot *slot, AggState *node);
+extern void ExecPushNullToAgg(TupleTableSlot *slot, AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
diff --git a/src/include/lib/simplehash.h b/src/include/lib/simplehash.h
index 6c6c3ee0d0..e865b87298 100644
--- a/src/include/lib/simplehash.h
+++ b/src/include/lib/simplehash.h
@@ -25,12 +25,33 @@
  *		declarations reside
  *    - SH_USE_NONDEFAULT_ALLOCATOR - if defined no element allocator functions
  *      are defined, so you can supply your own
+ *    - SH_FOREACH_ON -- if defined, SH_TYPE_foreach function for iterating
+ *		over the hashtable will be generated. This function works as follows:
+ *		SH_FOREACH_ACC_TYPE SH_TYPE_foreach(hashtable, void *direct_arg)
+ *		{
+ *		  accum = accum_init_val
+ *		  for each element in hashtable
+ *		    accum = accum_func(accum, foreach_func(element, direct_arg))
+ *		  return accum
+ *		}
+ *	  So, you have to specify the following macros if you use this:
+ *	  - SH_FOREACH_ACC_TYPE -- type of accum
+ *    - Some more if SH_DEFINE is defined
+
  *	  The following parameters are only relevant when SH_DEFINE is defined:
  *	  - SH_KEY - name of the element in SH_ELEMENT_TYPE containing the hash key
  *	  - SH_EQUAL(table, a, b) - compare two table keys
  *	  - SH_HASH_KEY(table, key) - generate hash for the key
  *	  - SH_STORE_HASH - if defined the hash is stored in the elements
  *	  - SH_GET_HASH(tb, a) - return the field to store the hash in
+ *    Macros for foreach:
+ *	  - SH_FOREACH_ACC_INIT -- initial value of accum
+ *	  - SH_FOREACH_FUNC -- name of foreach_func, it's prototype is
+ *	    SH_FOREACH_ACC_TYPE SH_TYPE_foreach(SH_ELEMENT_TYPE el,
+ *  										void *direct agg)
+ *    - SH_FOREACH_ACC_FUNC -- name of accum_func, it's prototype is
+ *      SH_FOREACH_ACC_TYPE accum_func(SH_FOREACH_ACC_TYPE old,
+ *									   SH_FOREACH_ACC_TYPE new)
  *
  *	  For examples of usage look at simplehash.c (file local definition) and
  *	  execnodes.h/execGrouping.c (exposed declaration, file local
@@ -75,6 +96,7 @@
 #define SH_INSERT SH_MAKE_NAME(insert)
 #define SH_DELETE SH_MAKE_NAME(delete)
 #define SH_LOOKUP SH_MAKE_NAME(lookup)
+#define SH_FOREACH SH_MAKE_NAME(foreach)
 #define SH_GROW SH_MAKE_NAME(grow)
 #define SH_START_ITERATE SH_MAKE_NAME(start_iterate)
 #define SH_START_ITERATE_AT SH_MAKE_NAME(start_iterate_at)
@@ -147,6 +169,9 @@ SH_SCOPE bool SH_DELETE(SH_TYPE *tb, SH_KEY_TYPE key);
 SH_SCOPE void SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter);
 SH_SCOPE void SH_START_ITERATE_AT(SH_TYPE *tb, SH_ITERATOR *iter, uint32 at);
 SH_SCOPE SH_ELEMENT_TYPE *SH_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter);
+#ifdef SH_FOREACH_ON
+SH_SCOPE SH_FOREACH_ACC_TYPE SH_FOREACH(SH_TYPE *tb, void *direct_arg);
+#endif	 /* SH_FOREACH_ON */
 SH_SCOPE void SH_STAT(SH_TYPE *tb);
 
 #endif   /* SH_DECLARE */
@@ -827,6 +852,35 @@ SH_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter)
 }
 
 /*
+ * Iterate over the hashtable, doing something with each value and accumulating
+ * the result.
+ */
+#ifdef SH_FOREACH_ON
+SH_SCOPE SH_FOREACH_ACC_TYPE SH_FOREACH(SH_TYPE *tb, void *direct_arg)
+{
+	uint32 cur = 0;
+	SH_FOREACH_ACC_TYPE accum = SH_FOREACH_ACC_INIT;
+	SH_FOREACH_ACC_TYPE new_accum;
+	SH_ELEMENT_TYPE *elem;
+
+	do
+	{
+		elem = &tb->data[cur];
+		if (elem->status == SH_STATUS_IN_USE)
+		{
+			new_accum = SH_FOREACH_FUNC(elem, direct_arg);
+			accum = SH_FOREACH_ACC_FUNC(accum, new_accum);
+		}
+		/* next element in forward direction */
+		cur = (cur + 1) & tb->sizemask;
+	} while (cur != 0);
+
+	return accum;
+}
+#endif	 /* SH_FOREACH_ON */
+
+
+/*
  * Report some statistics about the state of the hashtable. For
  * debugging/profiling purposes only.
  */
@@ -914,6 +968,11 @@ SH_STAT(SH_TYPE *tb)
 #undef SH_GET_HASH
 #undef SH_STORE_HASH
 #undef SH_USE_NONDEFAULT_ALLOCATOR
+#undef SH_FOREACH_ON
+#undef SH_FOREACH_ACC_TYPE
+#undef SH_FOREACH_ACC_INIT
+#undef SH_FOREACH_FUNC
+#undef SH_FOREACH_ACC_FUNC
 
 /* undefine locally declared macros */
 #undef SH_MAKE_PREFIX
@@ -942,6 +1001,7 @@ SH_STAT(SH_TYPE *tb)
 #undef SH_START_ITERATE
 #undef SH_START_ITERATE_AT
 #undef SH_ITERATE
+#undef SH_FOREACH
 #undef SH_ALLOCATE
 #undef SH_FREE
 #undef SH_STAT
-- 
2.11.0

0008-Reversed-in-memory-Sort-implementation.patchtext/x-diffDownload
From 1bf26fc1eaa6103342639e24a71e90d515c9b7a8 Mon Sep 17 00:00:00 2001
From: Arseny Sher <sher-ars@ispras.ru>
Date: Tue, 14 Mar 2017 20:03:41 +0300
Subject: [PATCH 8/8] Reversed in-memory Sort implementation.

Only in-memory sort is supported for now.
---
 src/backend/executor/execProcnode.c |  15 ++++
 src/backend/executor/nodeSort.c     | 139 +++++++++++-------------------------
 src/backend/utils/sort/tuplesort.c  |  33 +++++++++
 src/include/executor/nodeSort.h     |   6 +-
 src/include/utils/tuplesort.h       |   3 +
 5 files changed, 95 insertions(+), 101 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index ab0312a4bf..aadfce7f9d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -193,6 +193,11 @@ ExecInitNode(Plan *node, EState *estate, int eflags, PlanState *parent)
 		/*
 		 * materialization nodes
 		 */
+		case T_Sort:
+			result = (PlanState *) ExecInitSort((Sort *) node,
+												estate, eflags, parent);
+			break;
+
 		case T_Agg:
 			result = (PlanState *) ExecInitAgg((Agg *) node,
 											   estate, eflags, parent);
@@ -288,6 +293,9 @@ ExecPushTuple(TupleTableSlot *slot, PlanState *pusher)
 	if (nodeTag(receiver) == T_LimitState)
 		return ExecPushTupleToLimit(slot, (LimitState *) receiver);
 
+	else if (nodeTag(receiver) == T_SortState)
+		return ExecPushTupleToSort(slot, (SortState *) receiver);
+
 	else if (nodeTag(receiver) == T_AggState)
 		return ExecPushTupleToAgg(slot, (AggState *) receiver);
 
@@ -334,6 +342,9 @@ ExecPushNull(TupleTableSlot *slot, PlanState *pusher)
 	if (nodeTag(receiver) == T_LimitState)
 		return ExecPushNullToLimit(slot, (LimitState *) receiver);
 
+	else if (nodeTag(receiver) == T_SortState)
+		return ExecPushNullToSort(slot, (SortState *) receiver);
+
 	else if (nodeTag(receiver) == T_AggState)
 		return ExecPushNullToAgg(slot, (AggState *) receiver);
 
@@ -410,6 +421,10 @@ ExecEndNode(PlanState *node)
 		/*
 		 * materialization nodes
 		 */
+		case T_SortState:
+			ExecEndSort((SortState *) node);
+			break;
+
 		case T_AggState:
 			ExecEndAgg((AggState *) node);
 			break;
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index 0028912509..946e0c5f84 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -22,11 +22,10 @@
 
 
 /* ----------------------------------------------------------------
- *		ExecSort
+ *		ExecPushTupleToSort
  *
- *		Sorts tuples from the outer subtree of the node using tuplesort,
- *		which saves the results in a temporary file or memory. After the
- *		initial call, returns a tuple from the file with each call.
+ *		Puts tuple from the outer subtree of the node to tuplesort,
+ *		which saves the results in a temporary file or memory.
  *
  *		Conditions:
  *		  -- none.
@@ -35,110 +34,50 @@
  *		  -- the outer child is prepared to return the first tuple.
  * ----------------------------------------------------------------
  */
-TupleTableSlot *
-ExecSort(SortState *node)
+bool
+ExecPushTupleToSort(TupleTableSlot *slot, SortState *node)
 {
-	EState	   *estate;
-	ScanDirection dir;
-	Tuplesortstate *tuplesortstate;
-	TupleTableSlot *slot;
+	Sort	   *plannode = (Sort *) node->ss.ps.plan;
+	PlanState  *outerNode;
+	TupleDesc	tupDesc;
 
-	/*
-	 * get state info from node
-	 */
-	SO1_printf("ExecSort: %s\n",
-			   "entering routine");
-
-	estate = node->ss.ps.state;
-	dir = estate->es_direction;
-	tuplesortstate = (Tuplesortstate *) node->tuplesortstate;
+	/* bounded nodes not supported yet */
+	Assert(!node->bounded);
+	/* only forward direction is supported for now */
+	Assert(ScanDirectionIsForward(node->ss.ps.state->es_direction));
+	Assert(!node->sort_Done);
 
-	/*
-	 * If first time through, read all tuples from outer plan and pass them to
-	 * tuplesort.c. Subsequent calls just fetch tuples from tuplesort.
-	 */
-
-	if (!node->sort_Done)
+	if (node->tuplesortstate == NULL)
 	{
-		Sort	   *plannode = (Sort *) node->ss.ps.plan;
-		PlanState  *outerNode;
-		TupleDesc	tupDesc;
-
-		SO1_printf("ExecSort: %s\n",
-				   "sorting subplan");
-
-		/*
-		 * Want to scan subplan in the forward direction while creating the
-		 * sorted data.
-		 */
-		estate->es_direction = ForwardScanDirection;
-
-		/*
-		 * Initialize tuplesort module.
-		 */
-		SO1_printf("ExecSort: %s\n",
-				   "calling tuplesort_begin");
-
+		/* first call, time to create tuplesort */
 		outerNode = outerPlanState(node);
 		tupDesc = ExecGetResultType(outerNode);
 
-		tuplesortstate = tuplesort_begin_heap(tupDesc,
-											  plannode->numCols,
-											  plannode->sortColIdx,
-											  plannode->sortOperators,
-											  plannode->collations,
-											  plannode->nullsFirst,
-											  work_mem,
-											  node->randomAccess);
-		if (node->bounded)
-			tuplesort_set_bound(tuplesortstate, node->bound);
-		node->tuplesortstate = (void *) tuplesortstate;
-
-		/*
-		 * Scan the subplan and feed all the tuples to tuplesort.
-		 */
-
-		for (;;)
-		{
-			slot = ExecProcNode(outerNode);
-
-			if (TupIsNull(slot))
-				break;
-
-			tuplesort_puttupleslot(tuplesortstate, slot);
-		}
-
-		/*
-		 * Complete the sort.
-		 */
-		tuplesort_performsort(tuplesortstate);
-
-		/*
-		 * restore to user specified direction
-		 */
-		estate->es_direction = dir;
-
-		/*
-		 * finally set the sorted flag to true
-		 */
-		node->sort_Done = true;
-		node->bounded_Done = node->bounded;
-		node->bound_Done = node->bound;
-		SO1_printf("ExecSort: %s\n", "sorting done");
+		node->tuplesortstate = tuplesort_begin_heap(tupDesc,
+													plannode->numCols,
+													plannode->sortColIdx,
+													plannode->sortOperators,
+													plannode->collations,
+													plannode->nullsFirst,
+													work_mem,
+													node->randomAccess);
 	}
+	/* feed the tuple to tuplesort */
+	tuplesort_puttupleslot(node->tuplesortstate, slot);
+	return true;
+}
 
-	SO1_printf("ExecSort: %s\n",
-			   "retrieving tuple from tuplesort");
-
+/* NULL tuple arrived, sort and push tuples */
+void
+ExecPushNullToSort(TupleTableSlot *slot, SortState *node)
+{
 	/*
-	 * Get the first or next tuple from tuplesort. Returns NULL if no more
-	 * tuples.
+	 * Complete the sort.
 	 */
-	slot = node->ss.ps.ps_ResultTupleSlot;
-	(void) tuplesort_gettupleslot(tuplesortstate,
-								  ScanDirectionIsForward(dir),
-								  slot, NULL);
-	return slot;
+	tuplesort_performsort(node->tuplesortstate);
+	node->sort_Done = true;
+
+	tuplesort_pushtuples(node->tuplesortstate, node);
 }
 
 /* ----------------------------------------------------------------
@@ -149,7 +88,7 @@ ExecSort(SortState *node)
  * ----------------------------------------------------------------
  */
 SortState *
-ExecInitSort(Sort *node, EState *estate, int eflags)
+ExecInitSort(Sort *node, EState *estate, int eflags, PlanState *parent)
 {
 	SortState  *sortstate;
 
@@ -162,6 +101,7 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	sortstate = makeNode(SortState);
 	sortstate->ss.ps.plan = (Plan *) node;
 	sortstate->ss.ps.state = estate;
+	sortstate->ss.ps.parent = parent;
 
 	/*
 	 * We must have random access to the sort output to do backward scan or
@@ -199,7 +139,8 @@ ExecInitSort(Sort *node, EState *estate, int eflags)
 	 */
 	eflags &= ~(EXEC_FLAG_REWIND | EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK);
 
-	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags, NULL);
+	outerPlanState(sortstate) = ExecInitNode(outerPlan(node), estate, eflags,
+											 (PlanState *) sortstate);
 
 	/*
 	 * initialize tuple type.  no need to initialize projection info because
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index e1e692d5f0..7e078c22a3 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2080,6 +2080,39 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 }
 
 /*
+ * Push every tuple from tuplesort. Push null tuple after,
+ */
+void
+tuplesort_pushtuples(Tuplesortstate *state, SortState *node)
+{
+	SortTuple	stup;
+	MemoryContext oldcontext = CurrentMemoryContext;
+	TupleTableSlot *slot = node->ss.ps.ps_ResultTupleSlot;
+
+	/* only in mem sort is supported for now */
+	Assert(state->status == TSS_SORTEDINMEM);
+	Assert(!state->slabAllocatorUsed);
+
+	while (state->current < state->memtupcount)
+	{
+		/* Imitating context switching as it was before */
+		MemoryContextSwitchTo(state->sortcontext);
+		stup = state->memtuples[state->current++];
+		MemoryContextSwitchTo(oldcontext);
+
+		stup.tuple = heap_copy_minimal_tuple((MinimalTuple) stup.tuple);
+		ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, true);
+		if (!ExecPushTuple(slot, (PlanState *) node))
+			return;
+	}
+
+	/* If parent still waits for tuples, let it know we are done */
+	state->eof_reached = true;
+	ExecClearTuple(slot);
+	return ExecPushNull(slot, (PlanState *) node);
+}
+
+/*
  * Fetch the next tuple in either forward or back direction.
  * If successful, put tuple in slot and return TRUE; else, clear the slot
  * and return FALSE.
diff --git a/src/include/executor/nodeSort.h b/src/include/executor/nodeSort.h
index 10d16b47b1..3a726b3500 100644
--- a/src/include/executor/nodeSort.h
+++ b/src/include/executor/nodeSort.h
@@ -16,8 +16,10 @@
 
 #include "nodes/execnodes.h"
 
-extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags);
-extern TupleTableSlot *ExecSort(SortState *node);
+extern SortState *ExecInitSort(Sort *node, EState *estate, int eflags,
+							   PlanState *parent);
+extern bool ExecPushTupleToSort(TupleTableSlot *slot, SortState *node);
+extern void ExecPushNullToSort(TupleTableSlot *slot, SortState *node);
 extern void ExecEndSort(SortState *node);
 extern void ExecSortMarkPos(SortState *node);
 extern void ExecSortRestrPos(SortState *node);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 5b3f4752f4..45c6659bea 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -92,6 +92,9 @@ extern void tuplesort_putdatum(Tuplesortstate *state, Datum val,
 
 extern void tuplesort_performsort(Tuplesortstate *state);
 
+/* forward decl, since now we need to know about SortState */
+typedef struct SortState SortState;
+extern void tuplesort_pushtuples(Tuplesortstate *state, SortState *node);
 extern bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
 					   TupleTableSlot *slot, Datum *abbrev);
 extern HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward);
-- 
2.11.0

#8Arseny Sher
sher-ars@ispras.ru
In reply to: Arseny Sher (#7)
Re: [GSoC] Push-based query executor discussion

Time is short, student's application deadline is on 3rd April. I decided
to reformulate the project scope myself. Here is the proposal:

https://docs.google.com/document/d/1dvBETE6IJA9AcXd11XJNPsF_VPcDhSjy7rlsxj262l8/edit?usp=sharing

The main idea is that now there is a formalized goal of the project,
"partial support of all TPC-H queries".

I am also CC'ing people who was mentioned in "Potential Mentors" section
on GSoC wiki page.

--
Arseny Sher

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Alexander Korotkov
a.korotkov@postgrespro.ru
In reply to: Arseny Sher (#8)
Re: [GSoC] Push-based query executor discussion

On Sun, Apr 2, 2017 at 12:13 AM, Arseny Sher <sher-ars@ispras.ru> wrote:

Time is short, student's application deadline is on 3rd April. I decided
to reformulate the project scope myself. Here is the proposal:

https://docs.google.com/document/d/1dvBETE6IJA9AcXd11XJNPsF_
VPcDhSjy7rlsxj262l8/edit?usp=sharing

The main idea is that now there is a formalized goal of the project,
"partial support of all TPC-H queries".

I am also CC'ing people who was mentioned in "Potential Mentors" section
on GSoC wiki page.

I'd love to see a comment from Andres Freund who is leading executor
performance improvements.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#10Kevin Grittner
kgrittn@gmail.com
In reply to: Alexander Korotkov (#9)
Re: [HACKERS] [GSoC] Push-based query executor discussion

On Thu, Apr 6, 2017 at 8:11 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:

https://docs.google.com/document/d/1dvBETE6IJA9AcXd11XJNPsF_VPcDhSjy7rlsxj262l8/edit?usp=sharing

I'd love to see a comment from Andres Freund who is leading executor
performance improvements.

Note that the final proposal is here:

https://summerofcode.withgoogle.com/serve/5874530240167936/

Also, I just entered a comment about an important question that I
think needs to be answered right up front.

--
Kevin Grittner

--
Sent via pgsql-students mailing list (pgsql-students@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-students

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kevin Grittner (#10)
Re: [HACKERS] [GSoC] Push-based query executor discussion

Kevin Grittner <kgrittn@gmail.com> writes:

Note that the final proposal is here:
https://summerofcode.withgoogle.com/serve/5874530240167936/

I'm just getting a blank page at that URL?

regards, tom lane

--
Sent via pgsql-students mailing list (pgsql-students@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-students

#12Kevin Grittner
kgrittn@gmail.com
In reply to: Tom Lane (#11)
Re: [HACKERS] [GSoC] Push-based query executor discussion

Sorry, I didn't notice that this was going to a public list. That URL
is only available to people who signed up as mentors for PostgreSQL
GSoC participation this year. Does the link to the draft work for you?

--
Kevin Grittner

--
Sent via pgsql-students mailing list (pgsql-students@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-students

#13Simon Riggs
simon@2ndquadrant.com
In reply to: Oleg Bartunov (#5)
Re: [HACKERS] [GSoC] Push-based query executor discussion

On 22 March 2017 at 14:58, Oleg Bartunov <obartunov@gmail.com> wrote:

Should we reject this interesting project, which based on several years of
research work of academician group in the institute ? May be better help him
to reformulate the scope of project and let him work ? I don't know exactly
if the results of GSoC project should be committed , but as a research
project it's certainly would be useful for the community.

+1

Arseny, thank you for your contributions.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-students mailing list (pgsql-students@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-students