Parallel grouping sets

Started by Richard Guoover 6 years ago29 messages

Richard Guo

riguo@pivotal.io

over 6 years ago

1 attachment(s)

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1]https://github.com/greenplum-db/postgres/tree/parallel_groupingsets for this
work.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.

The plan looks like:

# explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
grouping sets((c1,c2), (c1), (c2,c3));
QUERY PLAN
---------------------------------------------------------
Finalize MixedAggregate
Output: c1, c2, avg(c3), c3
Hash Key: t2.c2, t2.c3
Group Key: t2.c1, t2.c2
Group Key: t2.c1
-> Gather Merge
Output: c1, c2, c3, (PARTIAL avg(c3))
Workers Planned: 2
-> Sort
Output: c1, c2, c3, (PARTIAL avg(c3))
Sort Key: t2.c1, t2.c2
-> Partial HashAggregate
Output: c1, c2, c3, PARTIAL avg(c3)
Group Key: t2.c1, t2.c2, t2.c3
-> Parallel Seq Scan on public.t2
Output: c1, c2, c3
(16 rows)

As the partial aggregation can be performed in parallel, we can expect a
speedup if the number of groups seen by the Finalize Aggregate node is
some less than the number of input rows.

For example, for the table provided in the test case within the patch,
running the above query in my Linux box:

# explain analyze select c1, c2, avg(c3) from t2 group by grouping
sets((c1,c2), (c1), (c2,c3)); -- without patch
Planning Time: 0.123 ms
Execution Time: 9459.362 ms

# explain analyze select c1, c2, avg(c3) from t2 group by grouping
sets((c1,c2), (c1), (c2,c3)); -- with patch
Planning Time: 0.204 ms
Execution Time: 1077.654 ms

But sometimes we may not benefit from this patch. For example, in the
worst-case scenario the number of groups seen by the Finalize Aggregate
node could be as many as the number of input rows which were seen by all
worker processes in the Partial Aggregate stage. This is prone to
happening with this patch, because the group key for Partial Aggregate
is all the columns involved in the grouping sets, such as in the above
query, it is (c1, c2, c3).

So, we have been working on another way to implement parallel grouping
sets.

Implementation 2
================

This work can be found in github branch [2]https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_2. As it contains some hacky
codes and a list of TODO items, this is far from a patch. So please
consider it as a PoC.

The idea is instead of performing grouping sets aggregation in Finalize
Aggregate, we perform it in Partial Aggregate.

The plan looks like:

# explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
grouping sets((c1,c2), (c1));
QUERY PLAN
--------------------------------------------------------------
Finalize GroupAggregate
Output: c1, c2, avg(c3), (gset_id)
Group Key: t2.c1, t2.c2, (gset_id)
-> Gather Merge
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Workers Planned: 2
-> Sort
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Sort Key: t2.c1, t2.c2, (gset_id)
-> Partial HashAggregate
Output: c1, c2, gset_id, PARTIAL avg(c3)
Hash Key: t2.c1, t2.c2
Hash Key: t2.c1
-> Parallel Seq Scan on public.t2
Output: c1, c2, c3
(15 rows)

With this method, there is a problem, i.e., in the final stage of
aggregation, the leader does not have a way to distinguish which tuple
comes from which grouping set, which turns out to be needed by leader
for merging the partial results.

For instance, suppose we have a table t(c1, c2, c3) containing one row
(1, NULL, 3), and we are selecting agg(c3) group by grouping sets
((c1,c2), (c1)). Then the leader would get two tuples via Gather node
for that row, both are (1, NULL, agg(3)), one is from group by (c1,c2)
and one is from group by (c1). If the leader cannot tell that the
two tuples are from two different grouping sets, it will merge them
incorrectly.

So we add a hidden column 'gset_id', representing grouping set id, to
the targetlist of Partial Aggregate node, as well as to the group key
for Finalize Aggregate node. So only tuples coming from the same
grouping set can get merged in the final stage of aggregation.

With this method, for grouping sets with multiple rollups, to simplify
the implementation, we generate a separate aggregation path for each
rollup, and then append them for the final path.

References:
[1]: https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
[2]: https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_2

Any comments and feedback are welcome.

Thanks
Richard

Attachments:

v1-0001-Implementing-parallel-grouping-sets.patchapplication/octet-stream; name=v1-0001-Implementing-parallel-grouping-sets.patchDownload

From 3f8b6f9ec4f853c1870a6b91d81829381937470d Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Tue, 11 Jun 2019 07:48:29 +0000
Subject: [PATCH] Implementing parallel grouping sets.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.

Co-authored-by: Richard Guo <riguo@pivotal.io>
Co-authored-by: Paul Guo <pguo@pivotal.io>
---
 src/backend/optimizer/plan/createplan.c            |   4 +-
 src/backend/optimizer/plan/planner.c               |  59 ++++---
 src/backend/optimizer/util/pathnode.c              |   2 +
 src/include/nodes/pathnodes.h                      |   1 +
 src/include/optimizer/pathnode.h                   |   1 +
 src/test/regress/expected/parallelgroupingsets.out | 178 +++++++++++++++++++++
 src/test/regress/sql/parallelgroupingsets.sql      |  43 +++++
 7 files changed, 265 insertions(+), 23 deletions(-)
 create mode 100644 src/test/regress/expected/parallelgroupingsets.out
 create mode 100644 src/test/regress/sql/parallelgroupingsets.sql

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5ad..6e9dfa5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2245,7 +2245,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
@@ -2283,7 +2283,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc..f6566f9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -176,7 +176,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggSplit aggsplit);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -4183,7 +4184,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggSplit aggsplit)
 {
 	Query	   *parse = root->parse;
 
@@ -4345,6 +4347,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  strat,
+										  aggsplit,
 										  new_rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -4502,6 +4505,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  path,
 											  (List *) parse->havingQual,
 											  AGG_MIXED,
+											  aggsplit,
 											  rollups,
 											  agg_costs,
 											  dNumGroups));
@@ -4518,6 +4522,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  AGG_SORTED,
+										  aggsplit,
 										  gd->rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -6406,7 +6411,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6473,7 +6478,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6508,7 +6520,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6556,17 +6568,27 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 														  dNumGroups);
 
 			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+			{
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, false, true,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6952,11 +6974,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2b..b5d79d2 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3014,6 +3014,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 Path *subpath,
 						 List *having_qual,
 						 AggStrategy aggstrategy,
+						 AggSplit aggsplit,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups)
@@ -3059,6 +3060,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->aggsplit = aggsplit;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d..739f279 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1693,6 +1693,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 } GroupingSetsPath;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3..9d912fd 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,6 +217,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  Path *subpath,
 												  List *having_qual,
 												  AggStrategy aggstrategy,
+												  AggSplit aggsplit,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups);
diff --git a/src/test/regress/expected/parallelgroupingsets.out b/src/test/regress/expected/parallelgroupingsets.out
new file mode 100644
index 0000000..52761e7
--- /dev/null
+++ b/src/test/regress/expected/parallelgroupingsets.out
@@ -0,0 +1,178 @@
+--
+-- grouping sets
+--
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int);
+insert into gstest select 1,10,100 from generate_series(1,1000000)i;
+insert into gstest select 1,10,200 from generate_series(1,1000000)i;
+insert into gstest select 1,20,30 from generate_series(1,1000000)i;
+insert into gstest select 2,30,40 from generate_series(1,1000000)i;
+insert into gstest select 2,40,50 from generate_series(1,1000000)i;
+insert into gstest select 3,50,60 from generate_series(1,1000000)i;
+insert into gstest select 1,NULL,000000 from generate_series(1,1000000)i;
+analyze gstest;
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   ->  Gather Merge
+         Output: c1, c2, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Sort
+               Output: c1, c2, (PARTIAL avg(c3))
+               Sort Key: gstest.c1, gstest.c2
+               ->  Partial HashAggregate
+                     Output: c1, c2, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(15 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |            avg             
+----+----+----------------------------
+  1 | 10 |       150.0000000000000000
+  1 | 20 |        30.0000000000000000
+  1 |    | 0.000000000000000000000000
+  1 |    |        82.5000000000000000
+  2 | 30 |        40.0000000000000000
+  2 | 40 |        50.0000000000000000
+  2 |    |        45.0000000000000000
+  3 | 50 |        60.0000000000000000
+  3 |    |        60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Finalize MixedAggregate
+   Output: c1, c2, c3, avg(c3)
+   Hash Key: gstest.c2, gstest.c3
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   ->  Gather Merge
+         Output: c1, c2, c3, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Sort
+               Output: c1, c2, c3, (PARTIAL avg(c3))
+               Sort Key: gstest.c1, gstest.c2
+               ->  Partial HashAggregate
+                     Output: c1, c2, c3, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(16 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |            avg             
+----+----+-----+----------------------------
+  1 | 10 |     |       150.0000000000000000
+  1 | 20 |     |        30.0000000000000000
+  1 |    |     | 0.000000000000000000000000
+  1 |    |     |        82.5000000000000000
+  2 | 30 |     |        40.0000000000000000
+  2 | 40 |     |        50.0000000000000000
+  2 |    |     |        45.0000000000000000
+  3 | 50 |     |        60.0000000000000000
+  3 |    |     |        60.0000000000000000
+    | 10 | 100 |       100.0000000000000000
+    | 10 | 200 |       200.0000000000000000
+    | 20 |  30 |        30.0000000000000000
+    | 30 |  40 |        40.0000000000000000
+    | 40 |  50 |        50.0000000000000000
+    | 50 |  60 |        60.0000000000000000
+    |    |   0 | 0.000000000000000000000000
+(16 rows)
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   ->  Gather Merge
+         Output: c1, c2, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial GroupAggregate
+               Output: c1, c2, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2
+               ->  Sort
+                     Output: c1, c2, c3
+                     Sort Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(15 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |            avg             
+----+----+----------------------------
+  1 | 10 |       150.0000000000000000
+  1 | 20 |        30.0000000000000000
+  1 |    | 0.000000000000000000000000
+  1 |    |        82.5000000000000000
+  2 | 30 |        40.0000000000000000
+  2 | 40 |        50.0000000000000000
+  2 |    |        45.0000000000000000
+  3 | 50 |        60.0000000000000000
+  3 |    |        60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, c3, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   Sort Key: gstest.c2, gstest.c3
+     Group Key: gstest.c2, gstest.c3
+   ->  Gather Merge
+         Output: c1, c2, c3, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial GroupAggregate
+               Output: c1, c2, c3, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2, gstest.c3
+               ->  Sort
+                     Output: c1, c2, c3
+                     Sort Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(17 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |            avg             
+----+----+-----+----------------------------
+  1 | 10 |     |       150.0000000000000000
+  1 | 20 |     |        30.0000000000000000
+  1 |    |     | 0.000000000000000000000000
+  1 |    |     |        82.5000000000000000
+  2 | 30 |     |        40.0000000000000000
+  2 | 40 |     |        50.0000000000000000
+  2 |    |     |        45.0000000000000000
+  3 | 50 |     |        60.0000000000000000
+  3 |    |     |        60.0000000000000000
+    | 10 | 100 |       100.0000000000000000
+    | 10 | 200 |       200.0000000000000000
+    | 20 |  30 |        30.0000000000000000
+    | 30 |  40 |        40.0000000000000000
+    | 40 |  50 |        50.0000000000000000
+    | 50 |  60 |        60.0000000000000000
+    |    |   0 | 0.000000000000000000000000
+(16 rows)
+
+drop table gstest;
diff --git a/src/test/regress/sql/parallelgroupingsets.sql b/src/test/regress/sql/parallelgroupingsets.sql
new file mode 100644
index 0000000..24cdb3b
--- /dev/null
+++ b/src/test/regress/sql/parallelgroupingsets.sql
@@ -0,0 +1,43 @@
+--
+-- grouping sets
+--
+
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int);
+
+insert into gstest select 1,10,100 from generate_series(1,1000000)i;
+insert into gstest select 1,10,200 from generate_series(1,1000000)i;
+insert into gstest select 1,20,30 from generate_series(1,1000000)i;
+insert into gstest select 2,30,40 from generate_series(1,1000000)i;
+insert into gstest select 2,40,50 from generate_series(1,1000000)i;
+insert into gstest select 3,50,60 from generate_series(1,1000000)i;
+insert into gstest select 1,NULL,000000 from generate_series(1,1000000)i;
+analyze gstest;
+
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+
+
+drop table gstest;
-- 
2.7.4

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Richard Guo (#1)

Re: Parallel grouping sets

On Wed, 12 Jun 2019 at 14:59, Richard Guo <riguo@pivotal.io> wrote:

Implementation 1

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.

Hi Richard,

I think it was you an I that discussed #1 at unconference at PGCon 2
weeks ago. The good thing about #1 is that it can be implemented as
planner-only changes just by adding some additional paths and some
costing. #2 will be useful when we're unable to reduce the number of
inputs to the final aggregate node by doing the initial grouping.
However, since #1 is easier, then I'd suggest going with it first,
since it's the path of least resistance. #1 should be fine as long as
you properly cost the parallel agg and don't choose it when the number
of groups going into the final agg isn't reduced by the partial agg
node. Which brings me to:

You'll need to do further work with the dNumGroups value. Since you're
grouping by all the columns/exprs in the grouping sets you'll need the
number of groups to be an estimate of that.

Here's a quick test I did that shows the problem:

create table abc(a int, b int, c int);
insert into abc select a,b,1 from generate_Series(1,1000)
a,generate_Series(1,1000) b;
create statistics abc_a_b_stats (ndistinct) on a,b from abc;
analyze abc;

-- Here the Partial HashAggregate really should estimate that there
will be 1 million rows.
explain analyze select a,b,sum(c) from abc group by grouping sets ((a),(b));
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=14137.67..14177.67 rows=2000 width=16)
(actual time=1482.746..1483.203 rows=2000 loops=1)
Hash Key: a
Hash Key: b
-> Gather (cost=13697.67..14117.67 rows=4000 width=16) (actual
time=442.140..765.931 rows=1000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=12697.67..12717.67 rows=2000
width=16) (actual time=402.917..526.045 rows=333333 loops=3)
Group Key: a, b
-> Parallel Seq Scan on abc (cost=0.00..9572.67
rows=416667 width=12) (actual time=0.036..50.275 rows=333333 loops=3)
Planning Time: 0.140 ms
Execution Time: 1489.734 ms
(11 rows)

but really, likely the parallel plan should not be chosen in this case
since we're not really reducing the number of groups going into the
finalize aggregate node. That'll need to be factored into the costing
so that we don't choose the parallel plan when we're not going to
reduce the work in the finalize aggregate node. I'm unsure exactly how
that'll look. Logically, I think the choice parallelize or not to
parallelize needs to be if (cost_partial_agg + cost_gather +
cost_final_agg < cost_agg) { do it in parallel } else { do it in
serial }. If you build both a serial and parallel set of paths then
you should see which one is cheaper without actually constructing an
"if" test like the one above.

Here's a simple group by with the same group by clause items as you
have in the plan above that does get the estimated number of groups
perfectly. The plan above should have the same estimate.

explain analyze select a,b,sum(c) from abc group by a,b;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=132154.34..152154.34 rows=1000000 width=16)
(actual time=404.304..1383.343 rows=1000000 loops=1)
Group Key: a, b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12) (actual
time=404.291..620.774 rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 21584kB
-> Seq Scan on abc (cost=0.00..15406.00 rows=1000000
width=12) (actual time=0.017..100.299 rows=1000000 loops=1)
Planning Time: 0.115 ms
Execution Time: 1412.034 ms
(8 rows)

Also, in the tests:

insert into gstest select 1,10,100 from generate_series(1,1000000)i;
insert into gstest select 1,10,200 from generate_series(1,1000000)i;
insert into gstest select 1,20,30 from generate_series(1,1000000)i;
insert into gstest select 2,30,40 from generate_series(1,1000000)i;
insert into gstest select 2,40,50 from generate_series(1,1000000)i;
insert into gstest select 3,50,60 from generate_series(1,1000000)i;
insert into gstest select 1,NULL,000000 from generate_series(1,1000000)i;
analyze gstest;

You'll likely want to reduce the number of rows being used just to
stop the regression tests becoming slow on older machines. I think
some of the other parallel aggregate tests use must fewer rows than
what you're using there. You might be able to use the standard set of
regression test tables too, tenk, tenk1 etc. That'll save the test
having to build and populate one of its own.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Richard Guo

riguo@pivotal.io

over 6 years ago

In reply to: David Rowley (#2)

1 attachment(s)

Re: Parallel grouping sets

On Thu, Jun 13, 2019 at 12:29 PM David Rowley <david.rowley@2ndquadrant.com>
wrote:

On Wed, 12 Jun 2019 at 14:59, Richard Guo <riguo@pivotal.io> wrote:

Implementation 1

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.

Hi Richard,

I think it was you an I that discussed #1 at unconference at PGCon 2
weeks ago. The good thing about #1 is that it can be implemented as
planner-only changes just by adding some additional paths and some
costing. #2 will be useful when we're unable to reduce the number of
inputs to the final aggregate node by doing the initial grouping.
However, since #1 is easier, then I'd suggest going with it first,
since it's the path of least resistance. #1 should be fine as long as
you properly cost the parallel agg and don't choose it when the number
of groups going into the final agg isn't reduced by the partial agg
node. Which brings me to:

Hi David,

Yes. Thank you for the discussion at PGCon. I learned a lot from that.
And glad to meet you here. :)

I agree with you on going with #1 first.

You'll need to do further work with the dNumGroups value. Since you're
grouping by all the columns/exprs in the grouping sets you'll need the
number of groups to be an estimate of that.

Exactly. The v1 patch estimates number of partial groups incorrectly, as
it calculates the number of groups for each grouping set and then add
them for dNumPartialPartialGroups, while we actually should calculate
the number of groups for all the columns in the grouping sets. I have
fixed this issue in v2 patch.

Here's a quick test I did that shows the problem:

create table abc(a int, b int, c int);
insert into abc select a,b,1 from generate_Series(1,1000)
a,generate_Series(1,1000) b;
create statistics abc_a_b_stats (ndistinct) on a,b from abc;
analyze abc;

-- Here the Partial HashAggregate really should estimate that there
will be 1 million rows.
explain analyze select a,b,sum(c) from abc group by grouping sets
((a),(b));
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=14137.67..14177.67 rows=2000 width=16)
(actual time=1482.746..1483.203 rows=2000 loops=1)
Hash Key: a
Hash Key: b
-> Gather (cost=13697.67..14117.67 rows=4000 width=16) (actual
time=442.140..765.931 rows=1000000 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=12697.67..12717.67 rows=2000
width=16) (actual time=402.917..526.045 rows=333333 loops=3)
Group Key: a, b
-> Parallel Seq Scan on abc (cost=0.00..9572.67
rows=416667 width=12) (actual time=0.036..50.275 rows=333333 loops=3)
Planning Time: 0.140 ms
Execution Time: 1489.734 ms
(11 rows)

but really, likely the parallel plan should not be chosen in this case
since we're not really reducing the number of groups going into the
finalize aggregate node. That'll need to be factored into the costing
so that we don't choose the parallel plan when we're not going to
reduce the work in the finalize aggregate node. I'm unsure exactly how
that'll look. Logically, I think the choice parallelize or not to
parallelize needs to be if (cost_partial_agg + cost_gather +
cost_final_agg < cost_agg) { do it in parallel } else { do it in
serial }. If you build both a serial and parallel set of paths then
you should see which one is cheaper without actually constructing an
"if" test like the one above.

Both the serial and parallel set of paths would be built and the cheaper
one will be selected. So we don't need the 'if' test.

With v2 patch, the parallel plan will not be chosen for the above query:

# explain analyze select a,b,sum(c) from abc group by grouping sets
((a),(b));
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=20406.00..25426.00 rows=2000 width=16) (actual
time=935.048..935.697 rows=2000 loops=1)
Hash Key: a
Hash Key: b
-> Seq Scan on abc (cost=0.00..15406.00 rows=1000000 width=12) (actual
time=0.041..170.906 rows=1000000 loops=1)
Planning Time: 0.240 ms
Execution Time: 935.978 ms
(6 rows)

Here's a simple group by with the same group by clause items as you
have in the plan above that does get the estimated number of groups
perfectly. The plan above should have the same estimate.

explain analyze select a,b,sum(c) from abc group by a,b;
QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=132154.34..152154.34 rows=1000000 width=16)
(actual time=404.304..1383.343 rows=1000000 loops=1)
Group Key: a, b
-> Sort (cost=132154.34..134654.34 rows=1000000 width=12) (actual
time=404.291..620.774 rows=1000000 loops=1)
Sort Key: a, b
Sort Method: external merge Disk: 21584kB
-> Seq Scan on abc (cost=0.00..15406.00 rows=1000000
width=12) (actual time=0.017..100.299 rows=1000000 loops=1)
Planning Time: 0.115 ms
Execution Time: 1412.034 ms
(8 rows)

Also, in the tests:

insert into gstest select 1,10,100 from generate_series(1,1000000)i;
insert into gstest select 1,10,200 from generate_series(1,1000000)i;
insert into gstest select 1,20,30 from generate_series(1,1000000)i;
insert into gstest select 2,30,40 from generate_series(1,1000000)i;
insert into gstest select 2,40,50 from generate_series(1,1000000)i;
insert into gstest select 3,50,60 from generate_series(1,1000000)i;
insert into gstest select 1,NULL,000000 from generate_series(1,1000000)i;
analyze gstest;

You'll likely want to reduce the number of rows being used just to
stop the regression tests becoming slow on older machines. I think
some of the other parallel aggregate tests use must fewer rows than
what you're using there. You might be able to use the standard set of
regression test tables too, tenk, tenk1 etc. That'll save the test
having to build and populate one of its own.

Yes, that makes sense. Table size has been reduced in v2 patch.
Currently I do not use the standard regression test tables as I'd like
to customize the table with some specific data for correctness
verification. But we may switch to the standard test table later.

Also in v2 patch, I'v fixed two addition issues. One is about the sort
key for sort-based grouping sets in Partial Aggregate, which should be
all the columns in parse->groupClause. The other one is about
GroupingFunc. Since Partial Aggregate will not handle multiple grouping
sets at once, it does not need to evaluate GroupingFunc. So GroupingFunc
is removed from the targetlists of Partial Aggregate.

Thanks
Richard

Attachments:

v2-0001-Implementing-parallel-grouping-sets.patchapplication/octet-stream; name=v2-0001-Implementing-parallel-grouping-sets.patchDownload

From 5dc64aad99976bcdead74f1dd2376073e32898f0 Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Tue, 11 Jun 2019 07:48:29 +0000
Subject: [PATCH] Implementing parallel grouping sets.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.
---
 src/backend/optimizer/plan/createplan.c            |   4 +-
 src/backend/optimizer/plan/planner.c               | 129 ++++++++++++----
 src/backend/optimizer/util/pathnode.c              |   2 +
 src/include/nodes/pathnodes.h                      |   1 +
 src/include/optimizer/pathnode.h                   |   1 +
 src/test/regress/expected/parallelgroupingsets.out | 172 +++++++++++++++++++++
 src/test/regress/sql/parallelgroupingsets.sql      |  43 ++++++
 7 files changed, 317 insertions(+), 35 deletions(-)
 create mode 100644 src/test/regress/expected/parallelgroupingsets.out
 create mode 100644 src/test/regress/sql/parallelgroupingsets.sql

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 608d5ad..6e9dfa5 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2245,7 +2245,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
@@ -2283,7 +2283,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cb897cc..2b6dd36 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -148,7 +148,8 @@ static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
 								   grouping_sets_data *gd,
-								   List *target_list);
+								   List *target_list,
+								   bool is_partial);
 static RelOptInfo *create_grouping_paths(PlannerInfo *root,
 										 RelOptInfo *input_rel,
 										 PathTarget *target,
@@ -176,7 +177,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggSplit aggsplit);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -3664,6 +3666,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
  * path_rows: number of output rows from scan/join step
  * gd: grouping sets data including list of grouping sets and their clauses
  * target_list: target list containing group clause references
+ * is_partial: whether the grouping is in partial aggregate
  *
  * If doing grouping sets, we also annotate the gsets data with the estimates
  * for each set and each individual rollup list, with a view to later
@@ -3673,7 +3676,8 @@ static double
 get_number_of_groups(PlannerInfo *root,
 					 double path_rows,
 					 grouping_sets_data *gd,
-					 List *target_list)
+					 List *target_list,
+					 bool is_partial)
 {
 	Query	   *parse = root->parse;
 	double		dNumGroups;
@@ -3682,7 +3686,7 @@ get_number_of_groups(PlannerInfo *root,
 	{
 		List	   *groupExprs;
 
-		if (parse->groupingSets)
+		if (parse->groupingSets && !is_partial)
 		{
 			/* Add up the estimates for each grouping set */
 			ListCell   *lc;
@@ -3745,7 +3749,7 @@ get_number_of_groups(PlannerInfo *root,
 		}
 		else
 		{
-			/* Plain GROUP BY */
+			/* Plain GROUP BY, or grouping is in partial aggregate */
 			groupExprs = get_sortgrouplist_exprs(parse->groupClause,
 												 target_list);
 
@@ -4138,7 +4142,8 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	dNumGroups = get_number_of_groups(root,
 									  cheapest_path->rows,
 									  gd,
-									  extra->targetList);
+									  extra->targetList,
+									  false);
 
 	/* Build final grouping paths */
 	add_paths_to_grouping_rel(root, input_rel, grouped_rel,
@@ -4183,7 +4188,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggSplit aggsplit)
 {
 	Query	   *parse = root->parse;
 
@@ -4345,6 +4351,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  strat,
+										  aggsplit,
 										  new_rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -4502,6 +4509,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  path,
 											  (List *) parse->havingQual,
 											  AGG_MIXED,
+											  aggsplit,
 											  rollups,
 											  agg_costs,
 											  dNumGroups));
@@ -4518,6 +4526,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  AGG_SORTED,
+										  aggsplit,
 										  gd->rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -5192,7 +5201,15 @@ make_partial_grouping_target(PlannerInfo *root,
 	foreach(lc, grouping_target->exprs)
 	{
 		Expr	   *expr = (Expr *) lfirst(lc);
-		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i);
+		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i++);
+
+		/*
+		 * GroupingFunc does not need to be evaluated in Partial Aggregate,
+		 * since Partial Aggregate will not handle multiple grouping sets at
+		 * once.
+		 */
+		if (IsA(expr, GroupingFunc))
+			continue;
 
 		if (sgref && parse->groupClause &&
 			get_sortgroupref_clause_noerr(sgref, parse->groupClause) != NULL)
@@ -5211,8 +5228,6 @@ make_partial_grouping_target(PlannerInfo *root,
 			 */
 			non_group_cols = lappend(non_group_cols, expr);
 		}
-
-		i++;
 	}
 
 	/*
@@ -6406,7 +6421,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6473,7 +6488,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6508,7 +6530,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6556,17 +6578,27 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 														  dNumGroups);
 
 			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+			{
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, false, true,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6706,13 +6738,15 @@ create_partial_grouping_paths(PlannerInfo *root,
 			get_number_of_groups(root,
 								 cheapest_total_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 	if (cheapest_partial_path != NULL)
 		dNumPartialPartialGroups =
 			get_number_of_groups(root,
 								 cheapest_partial_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 
 	if (can_sort && cheapest_total_path != NULL)
 	{
@@ -6734,11 +6768,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_path(partially_grouped_rel, (Path *)
@@ -6778,11 +6829,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
@@ -6952,11 +7020,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d884d2b..b5d79d2 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3014,6 +3014,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 Path *subpath,
 						 List *having_qual,
 						 AggStrategy aggstrategy,
+						 AggSplit aggsplit,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups)
@@ -3059,6 +3060,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->aggsplit = aggsplit;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 4b7703d..739f279 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1693,6 +1693,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 } GroupingSetsPath;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e70d6a3..9d912fd 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,6 +217,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  Path *subpath,
 												  List *having_qual,
 												  AggStrategy aggstrategy,
+												  AggSplit aggsplit,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups);
diff --git a/src/test/regress/expected/parallelgroupingsets.out b/src/test/regress/expected/parallelgroupingsets.out
new file mode 100644
index 0000000..97b181e
--- /dev/null
+++ b/src/test/regress/expected/parallelgroupingsets.out
@@ -0,0 +1,172 @@
+--
+-- grouping sets
+--
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+                      QUERY PLAN                      
+------------------------------------------------------
+ Finalize HashAggregate
+   Output: c1, c2, avg(c3)
+   Hash Key: gstest.c1, gstest.c2
+   Hash Key: gstest.c1
+   ->  Gather
+         Output: c1, c2, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial HashAggregate
+               Output: c1, c2, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2
+               ->  Parallel Seq Scan on public.gstest
+                     Output: c1, c2, c3
+(12 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Finalize HashAggregate
+   Output: c1, c2, c3, avg(c3)
+   Hash Key: gstest.c1, gstest.c2
+   Hash Key: gstest.c1
+   Hash Key: gstest.c2, gstest.c3
+   ->  Gather
+         Output: c1, c2, c3, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial HashAggregate
+               Output: c1, c2, c3, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2, gstest.c3
+               ->  Parallel Seq Scan on public.gstest
+                     Output: c1, c2, c3
+(13 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   ->  Gather Merge
+         Output: c1, c2, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial GroupAggregate
+               Output: c1, c2, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2
+               ->  Sort
+                     Output: c1, c2, c3
+                     Sort Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(15 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, c3, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   Sort Key: gstest.c2, gstest.c3
+     Group Key: gstest.c2, gstest.c3
+   ->  Gather Merge
+         Output: c1, c2, c3, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial GroupAggregate
+               Output: c1, c2, c3, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2, gstest.c3
+               ->  Sort
+                     Output: c1, c2, c3
+                     Sort Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(17 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+drop table gstest;
diff --git a/src/test/regress/sql/parallelgroupingsets.sql b/src/test/regress/sql/parallelgroupingsets.sql
new file mode 100644
index 0000000..5a84938
--- /dev/null
+++ b/src/test/regress/sql/parallelgroupingsets.sql
@@ -0,0 +1,43 @@
+--
+-- grouping sets
+--
+
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3));
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1), (c2,c3)) order by 1,2,3,4;
+
+
+drop table gstest;
-- 
2.7.4

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Richard Guo (#1)

Re: Parallel grouping sets

On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1] for this
work.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.

The plan looks like:

# explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
grouping sets((c1,c2), (c1), (c2,c3));
QUERY PLAN
---------------------------------------------------------
Finalize MixedAggregate
Output: c1, c2, avg(c3), c3
Hash Key: t2.c2, t2.c3
Group Key: t2.c1, t2.c2
Group Key: t2.c1
-> Gather Merge
Output: c1, c2, c3, (PARTIAL avg(c3))
Workers Planned: 2
-> Sort
Output: c1, c2, c3, (PARTIAL avg(c3))
Sort Key: t2.c1, t2.c2
-> Partial HashAggregate
Output: c1, c2, c3, PARTIAL avg(c3)
Group Key: t2.c1, t2.c2, t2.c3
-> Parallel Seq Scan on public.t2
Output: c1, c2, c3
(16 rows)

As the partial aggregation can be performed in parallel, we can expect a
speedup if the number of groups seen by the Finalize Aggregate node is
some less than the number of input rows.

For example, for the table provided in the test case within the patch,
running the above query in my Linux box:

# explain analyze select c1, c2, avg(c3) from t2 group by grouping
sets((c1,c2), (c1), (c2,c3)); -- without patch
Planning Time: 0.123 ms
Execution Time: 9459.362 ms

# explain analyze select c1, c2, avg(c3) from t2 group by grouping
sets((c1,c2), (c1), (c2,c3)); -- with patch
Planning Time: 0.204 ms
Execution Time: 1077.654 ms

Very nice. That's pretty much exactly how I imagined it'd work.

But sometimes we may not benefit from this patch. For example, in the
worst-case scenario the number of groups seen by the Finalize Aggregate
node could be as many as the number of input rows which were seen by all
worker processes in the Partial Aggregate stage. This is prone to
happening with this patch, because the group key for Partial Aggregate
is all the columns involved in the grouping sets, such as in the above
query, it is (c1, c2, c3).

So, we have been working on another way to implement parallel grouping
sets.

Implementation 2
================

This work can be found in github branch [2]. As it contains some hacky
codes and a list of TODO items, this is far from a patch. So please
consider it as a PoC.

The idea is instead of performing grouping sets aggregation in Finalize
Aggregate, we perform it in Partial Aggregate.

The plan looks like:

# explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
grouping sets((c1,c2), (c1));
QUERY PLAN
--------------------------------------------------------------
Finalize GroupAggregate
Output: c1, c2, avg(c3), (gset_id)
Group Key: t2.c1, t2.c2, (gset_id)
-> Gather Merge
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Workers Planned: 2
-> Sort
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Sort Key: t2.c1, t2.c2, (gset_id)
-> Partial HashAggregate
Output: c1, c2, gset_id, PARTIAL avg(c3)
Hash Key: t2.c1, t2.c2
Hash Key: t2.c1
-> Parallel Seq Scan on public.t2
Output: c1, c2, c3
(15 rows)

OK, I'm not sure I understand the point of this - can you give an
example which is supposed to benefit from this? Where does the speedup
came from?

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David Rowley

david.rowley@2ndquadrant.com

over 6 years ago

In reply to: Tomas Vondra (#4)

Re: Parallel grouping sets

On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:

# explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
grouping sets((c1,c2), (c1));
QUERY PLAN
--------------------------------------------------------------
Finalize GroupAggregate
Output: c1, c2, avg(c3), (gset_id)
Group Key: t2.c1, t2.c2, (gset_id)
-> Gather Merge
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Workers Planned: 2
-> Sort
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Sort Key: t2.c1, t2.c2, (gset_id)
-> Partial HashAggregate
Output: c1, c2, gset_id, PARTIAL avg(c3)
Hash Key: t2.c1, t2.c2
Hash Key: t2.c1
-> Parallel Seq Scan on public.t2
Output: c1, c2, c3
(15 rows)

OK, I'm not sure I understand the point of this - can you give an
example which is supposed to benefit from this? Where does the speedup
came from?

I think this is a bad example since the first grouping set is a
superset of the 2nd. If those were independent and each grouping set
produced a reasonable number of groups then it may be better to do it
this way instead of grouping by all exprs in all grouping sets in the
first phase, as is done by #1. To do #2 would require that we tag
the aggregate state with the grouping set that belong to, which seem
to be what gset_id is in Richard's output.

In my example upthread the first phase of aggregation produced a group
per input row. Method #2 would work better for that case since it
would only produce 2000 groups instead of 1 million.

Likely both methods would be good to consider, but since #1 seems much
easier than #2, then to me it seems to make sense to start there.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: David Rowley (#5)

Re: Parallel grouping sets

On Fri, Jun 14, 2019 at 12:02:52PM +1200, David Rowley wrote:

On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:

# explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
grouping sets((c1,c2), (c1));
QUERY PLAN
--------------------------------------------------------------
Finalize GroupAggregate
Output: c1, c2, avg(c3), (gset_id)
Group Key: t2.c1, t2.c2, (gset_id)
-> Gather Merge
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Workers Planned: 2
-> Sort
Output: c1, c2, (gset_id), (PARTIAL avg(c3))
Sort Key: t2.c1, t2.c2, (gset_id)
-> Partial HashAggregate
Output: c1, c2, gset_id, PARTIAL avg(c3)
Hash Key: t2.c1, t2.c2
Hash Key: t2.c1
-> Parallel Seq Scan on public.t2
Output: c1, c2, c3
(15 rows)

OK, I'm not sure I understand the point of this - can you give an
example which is supposed to benefit from this? Where does the speedup
came from?

I think this is a bad example since the first grouping set is a
superset of the 2nd. If those were independent and each grouping set
produced a reasonable number of groups then it may be better to do it
this way instead of grouping by all exprs in all grouping sets in the
first phase, as is done by #1. To do #2 would require that we tag
the aggregate state with the grouping set that belong to, which seem
to be what gset_id is in Richard's output.

Aha! So if we have grouping sets (a,b) and (c,d), then with the first
approach we'd do partial aggregate on (a,b,c,d) - which may produce
quite a few distinct groups, making it inefficient. But with the second
approach, we'd do just (a,b) and (c,d) and mark the rows with gset_id.

Neat!

In my example upthread the first phase of aggregation produced a group
per input row. Method #2 would work better for that case since it
would only produce 2000 groups instead of 1 million.

Likely both methods would be good to consider, but since #1 seems much
easier than #2, then to me it seems to make sense to start there.

Yep. Thanks for the explanation.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Richard Guo

riguo@pivotal.io

over 6 years ago

In reply to: Richard Guo (#1)

1 attachment(s)

Re: Parallel grouping sets

On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo@pivotal.io> wrote:

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1] for this
work.

Rebased with the latest master.

Thanks
Richard

Attachments:

v3-0001-Implementing-parallel-grouping-sets.patchapplication/octet-stream; name=v3-0001-Implementing-parallel-grouping-sets.patchDownload

From ae0d372bc194119013a66c63e4ec371db23779be Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Tue, 11 Jun 2019 07:48:29 +0000
Subject: [PATCH] Implementing parallel grouping sets.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.
---
 src/backend/optimizer/plan/createplan.c            |   4 +-
 src/backend/optimizer/plan/planner.c               | 129 ++++++++++++----
 src/backend/optimizer/util/pathnode.c              |   2 +
 src/include/nodes/pathnodes.h                      |   1 +
 src/include/optimizer/pathnode.h                   |   1 +
 .../regress/expected/groupingsets_parallel.out     | 172 +++++++++++++++++++++
 src/test/regress/parallel_schedule                 |   1 +
 src/test/regress/serial_schedule                   |   1 +
 src/test/regress/sql/groupingsets_parallel.sql     |  43 ++++++
 9 files changed, 319 insertions(+), 35 deletions(-)
 create mode 100644 src/test/regress/expected/groupingsets_parallel.out
 create mode 100644 src/test/regress/sql/groupingsets_parallel.sql

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c6b8553..a6dd314 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2244,7 +2244,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
@@ -2282,7 +2282,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36fefd9..b1adda5 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -148,7 +148,8 @@ static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
 								   grouping_sets_data *gd,
-								   List *target_list);
+								   List *target_list,
+								   bool is_partial);
 static RelOptInfo *create_grouping_paths(PlannerInfo *root,
 										 RelOptInfo *input_rel,
 										 PathTarget *target,
@@ -176,7 +177,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggSplit aggsplit);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -3670,6 +3672,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
  * path_rows: number of output rows from scan/join step
  * gd: grouping sets data including list of grouping sets and their clauses
  * target_list: target list containing group clause references
+ * is_partial: whether the grouping is in partial aggregate
  *
  * If doing grouping sets, we also annotate the gsets data with the estimates
  * for each set and each individual rollup list, with a view to later
@@ -3679,7 +3682,8 @@ static double
 get_number_of_groups(PlannerInfo *root,
 					 double path_rows,
 					 grouping_sets_data *gd,
-					 List *target_list)
+					 List *target_list,
+					 bool is_partial)
 {
 	Query	   *parse = root->parse;
 	double		dNumGroups;
@@ -3688,7 +3692,7 @@ get_number_of_groups(PlannerInfo *root,
 	{
 		List	   *groupExprs;
 
-		if (parse->groupingSets)
+		if (parse->groupingSets && !is_partial)
 		{
 			/* Add up the estimates for each grouping set */
 			ListCell   *lc;
@@ -3751,7 +3755,7 @@ get_number_of_groups(PlannerInfo *root,
 		}
 		else
 		{
-			/* Plain GROUP BY */
+			/* Plain GROUP BY, or grouping is in partial aggregate */
 			groupExprs = get_sortgrouplist_exprs(parse->groupClause,
 												 target_list);
 
@@ -4144,7 +4148,8 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	dNumGroups = get_number_of_groups(root,
 									  cheapest_path->rows,
 									  gd,
-									  extra->targetList);
+									  extra->targetList,
+									  false);
 
 	/* Build final grouping paths */
 	add_paths_to_grouping_rel(root, input_rel, grouped_rel,
@@ -4189,7 +4194,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggSplit aggsplit)
 {
 	Query	   *parse = root->parse;
 
@@ -4351,6 +4357,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  strat,
+										  aggsplit,
 										  new_rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -4508,6 +4515,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  path,
 											  (List *) parse->havingQual,
 											  AGG_MIXED,
+											  aggsplit,
 											  rollups,
 											  agg_costs,
 											  dNumGroups));
@@ -4524,6 +4532,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  AGG_SORTED,
+										  aggsplit,
 										  gd->rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -5198,7 +5207,15 @@ make_partial_grouping_target(PlannerInfo *root,
 	foreach(lc, grouping_target->exprs)
 	{
 		Expr	   *expr = (Expr *) lfirst(lc);
-		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i);
+		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i++);
+
+		/*
+		 * GroupingFunc does not need to be evaluated in Partial Aggregate,
+		 * since Partial Aggregate will not handle multiple grouping sets at
+		 * once.
+		 */
+		if (IsA(expr, GroupingFunc))
+			continue;
 
 		if (sgref && parse->groupClause &&
 			get_sortgroupref_clause_noerr(sgref, parse->groupClause) != NULL)
@@ -5217,8 +5234,6 @@ make_partial_grouping_target(PlannerInfo *root,
 			 */
 			non_group_cols = lappend(non_group_cols, expr);
 		}
-
-		i++;
 	}
 
 	/*
@@ -6412,7 +6427,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6479,7 +6494,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6514,7 +6536,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6562,17 +6584,27 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 														  dNumGroups);
 
 			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+			{
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, false, true,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6712,13 +6744,15 @@ create_partial_grouping_paths(PlannerInfo *root,
 			get_number_of_groups(root,
 								 cheapest_total_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 	if (cheapest_partial_path != NULL)
 		dNumPartialPartialGroups =
 			get_number_of_groups(root,
 								 cheapest_partial_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 
 	if (can_sort && cheapest_total_path != NULL)
 	{
@@ -6740,11 +6774,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_path(partially_grouped_rel, (Path *)
@@ -6784,11 +6835,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
@@ -6958,11 +7026,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0ac7398..6c1b5d9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2990,6 +2990,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 Path *subpath,
 						 List *having_qual,
 						 AggStrategy aggstrategy,
+						 AggSplit aggsplit,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups)
@@ -3035,6 +3036,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->aggsplit = aggsplit;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index e3c579e..6b89a12 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1698,6 +1698,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 } GroupingSetsPath;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 182ffee..6288da8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -215,6 +215,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  Path *subpath,
 												  List *having_qual,
 												  AggStrategy aggstrategy,
+												  AggSplit aggsplit,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups);
diff --git a/src/test/regress/expected/groupingsets_parallel.out b/src/test/regress/expected/groupingsets_parallel.out
new file mode 100644
index 0000000..b0b143f
--- /dev/null
+++ b/src/test/regress/expected/groupingsets_parallel.out
@@ -0,0 +1,172 @@
+--
+-- parallel grouping sets
+--
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+                      QUERY PLAN                      
+------------------------------------------------------
+ Finalize HashAggregate
+   Output: c1, c2, avg(c3)
+   Hash Key: gstest.c1, gstest.c2
+   Hash Key: gstest.c1
+   ->  Gather
+         Output: c1, c2, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial HashAggregate
+               Output: c1, c2, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2
+               ->  Parallel Seq Scan on public.gstest
+                     Output: c1, c2, c3
+(12 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3));
+                        QUERY PLAN                        
+----------------------------------------------------------
+ Finalize HashAggregate
+   Output: c1, c2, c3, avg(c3)
+   Hash Key: gstest.c1, gstest.c2
+   Hash Key: gstest.c1
+   Hash Key: gstest.c2, gstest.c3
+   ->  Gather
+         Output: c1, c2, c3, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial HashAggregate
+               Output: c1, c2, c3, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2, gstest.c3
+               ->  Parallel Seq Scan on public.gstest
+                     Output: c1, c2, c3
+(13 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   ->  Gather Merge
+         Output: c1, c2, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial GroupAggregate
+               Output: c1, c2, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2
+               ->  Sort
+                     Output: c1, c2, c3
+                     Sort Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(15 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3));
+                          QUERY PLAN                           
+---------------------------------------------------------------
+ Finalize GroupAggregate
+   Output: c1, c2, c3, avg(c3)
+   Group Key: gstest.c1, gstest.c2
+   Group Key: gstest.c1
+   Sort Key: gstest.c2, gstest.c3
+     Group Key: gstest.c2, gstest.c3
+   ->  Gather Merge
+         Output: c1, c2, c3, (PARTIAL avg(c3))
+         Workers Planned: 4
+         ->  Partial GroupAggregate
+               Output: c1, c2, c3, PARTIAL avg(c3)
+               Group Key: gstest.c1, gstest.c2, gstest.c3
+               ->  Sort
+                     Output: c1, c2, c3
+                     Sort Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(17 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+drop table gstest;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fb55f0..8a88b83 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -87,6 +87,7 @@ test: rules psql psql_crosstab amutils stats_ext
 # run by itself so it can run parallel workers
 test: select_parallel
 test: write_parallel
+test: groupingsets_parallel
 
 # no relation related tests can be put in this group
 test: publication subscription
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a39ca10..4495155 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -140,6 +140,7 @@ test: amutils
 test: stats_ext
 test: select_parallel
 test: write_parallel
+test: groupingsets_parallel
 test: publication
 test: subscription
 test: select_views
diff --git a/src/test/regress/sql/groupingsets_parallel.sql b/src/test/regress/sql/groupingsets_parallel.sql
new file mode 100644
index 0000000..fee2c9a
--- /dev/null
+++ b/src/test/regress/sql/groupingsets_parallel.sql
@@ -0,0 +1,43 @@
+--
+-- parallel grouping sets
+--
+
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3));
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1));
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3));
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+
+
+drop table gstest;
-- 
2.7.4

Tomas Vondra

tomas.vondra@2ndquadrant.com

over 6 years ago

In reply to: Richard Guo (#7)

Re: Parallel grouping sets

On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:

On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo@pivotal.io> wrote:

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1] for this
work.

Rebased with the latest master.

Hi Richard,

thanks for the rebased patch. I think the patch is mostly fine (at least I
don't see any serious issues). A couple minor comments:

1) I think get_number_of_groups() would deserve a short explanation why
it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
same branch. Before these cases were clearly separated, now it seems a bit
mixed up and it may not be immediately obvious why it's OK.

2) There are new regression tests, but they are not added to any schedule
(parallel or serial), and so are not executed as part of "make check". I
suppose this is a mistake.

3) The regression tests do check plan and results like this:

EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
SELECT ... ORDER BY 1, 2, 3;

which however means that the query might easily use a different plan than
what's verified in the eplain (thanks to the additional ORDER BY clause).
So I think this should explain and execute the same query.

(In this case the plans seems to be the same, but that may easily change
in the future, and we could miss it here, failing to verify the results.)

4) It might be a good idea to check the negative case too, i.e. a query on
data set that we should not parallelize (because the number of partial
groups would be too high).

Do you have any plans to hack on the second approach too? AFAICS those two
approaches are complementary (address different data sets / queries), and
it would be nice to have both. One of the things I've been wondering is if
we need to invent gset_id as a new concept, or if we could simply use the
existing GROUPING() function - that uniquely identifies the grouping set.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Richard Guo

riguo@pivotal.io

over 6 years ago

In reply to: Tomas Vondra (#8)

1 attachment(s)

Re: Parallel grouping sets

On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:

On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo@pivotal.io> wrote:

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1] for this
work.

Rebased with the latest master.

Hi Richard,

thanks for the rebased patch. I think the patch is mostly fine (at least I
don't see any serious issues). A couple minor comments:

Hi Tomas,

Thank you for reviewing this patch.

1) I think get_number_of_groups() would deserve a short explanation why
it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
same branch. Before these cases were clearly separated, now it seems a bit
mixed up and it may not be immediately obvious why it's OK.

Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.

2) There are new regression tests, but they are not added to any schedule
(parallel or serial), and so are not executed as part of "make check". I
suppose this is a mistake.

Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.

3) The regression tests do check plan and results like this:

EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
SELECT ... ORDER BY 1, 2, 3;

which however means that the query might easily use a different plan than
what's verified in the eplain (thanks to the additional ORDER BY clause).
So I think this should explain and execute the same query.

(In this case the plans seems to be the same, but that may easily change
in the future, and we could miss it here, failing to verify the results.)

Thank you for pointing this out. Fixed it in V4 patch.

4) It might be a good idea to check the negative case too, i.e. a query on
data set that we should not parallelize (because the number of partial
groups would be too high).

Yes, agree. Added a negative case.

Do you have any plans to hack on the second approach too? AFAICS those two
approaches are complementary (address different data sets / queries), and
it would be nice to have both. One of the things I've been wondering is if
we need to invent gset_id as a new concept, or if we could simply use the
existing GROUPING() function - that uniquely identifies the grouping set.

Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach. I agree with you that we can try GROUPING()
function to see if it can replace gset_id.

Thanks
Richard

Attachments:

v4-0001-Implementing-parallel-grouping-sets.patchapplication/octet-stream; name=v4-0001-Implementing-parallel-grouping-sets.patchDownload

From 6d8a37cbb1252c6fc298d6848a5e061fbc13feb7 Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Tue, 11 Jun 2019 07:48:29 +0000
Subject: [PATCH] Implementing parallel grouping sets.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.
---
 src/backend/optimizer/plan/createplan.c            |   4 +-
 src/backend/optimizer/plan/planner.c               | 137 ++++++++++----
 src/backend/optimizer/util/pathnode.c              |   2 +
 src/include/nodes/pathnodes.h                      |   1 +
 src/include/optimizer/pathnode.h                   |   1 +
 .../regress/expected/groupingsets_parallel.out     | 201 +++++++++++++++++++++
 src/test/regress/parallel_schedule                 |   1 +
 src/test/regress/serial_schedule                   |   1 +
 src/test/regress/sql/groupingsets_parallel.sql     |  50 +++++
 9 files changed, 363 insertions(+), 35 deletions(-)
 create mode 100644 src/test/regress/expected/groupingsets_parallel.out
 create mode 100644 src/test/regress/sql/groupingsets_parallel.sql

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c6b8553..a6dd314 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2244,7 +2244,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
@@ -2282,7 +2282,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 36fefd9..cd8e276 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -148,7 +148,8 @@ static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
 								   grouping_sets_data *gd,
-								   List *target_list);
+								   List *target_list,
+								   bool is_partial);
 static RelOptInfo *create_grouping_paths(PlannerInfo *root,
 										 RelOptInfo *input_rel,
 										 PathTarget *target,
@@ -176,7 +177,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggSplit aggsplit);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -3670,6 +3672,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
  * path_rows: number of output rows from scan/join step
  * gd: grouping sets data including list of grouping sets and their clauses
  * target_list: target list containing group clause references
+ * is_partial: whether the grouping is in partial aggregate
  *
  * If doing grouping sets, we also annotate the gsets data with the estimates
  * for each set and each individual rollup list, with a view to later
@@ -3679,7 +3682,8 @@ static double
 get_number_of_groups(PlannerInfo *root,
 					 double path_rows,
 					 grouping_sets_data *gd,
-					 List *target_list)
+					 List *target_list,
+					 bool is_partial)
 {
 	Query	   *parse = root->parse;
 	double		dNumGroups;
@@ -3688,7 +3692,15 @@ get_number_of_groups(PlannerInfo *root,
 	{
 		List	   *groupExprs;
 
-		if (parse->groupingSets)
+		/*
+		 * Grouping sets
+		 *
+		 * If we are doing partial aggregation for grouping sets, we are
+		 * supposed to estimate number of groups based on all the columns in
+		 * parse->groupClause.  Otherwise, we can add up the estimates for
+		 * each grouping set.
+		 */
+		if (parse->groupingSets && !is_partial)
 		{
 			/* Add up the estimates for each grouping set */
 			ListCell   *lc;
@@ -3751,7 +3763,7 @@ get_number_of_groups(PlannerInfo *root,
 		}
 		else
 		{
-			/* Plain GROUP BY */
+			/* Plain GROUP BY, or grouping is in partial aggregate */
 			groupExprs = get_sortgrouplist_exprs(parse->groupClause,
 												 target_list);
 
@@ -4144,7 +4156,8 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	dNumGroups = get_number_of_groups(root,
 									  cheapest_path->rows,
 									  gd,
-									  extra->targetList);
+									  extra->targetList,
+									  false);
 
 	/* Build final grouping paths */
 	add_paths_to_grouping_rel(root, input_rel, grouped_rel,
@@ -4189,7 +4202,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggSplit aggsplit)
 {
 	Query	   *parse = root->parse;
 
@@ -4351,6 +4365,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  strat,
+										  aggsplit,
 										  new_rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -4508,6 +4523,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  path,
 											  (List *) parse->havingQual,
 											  AGG_MIXED,
+											  aggsplit,
 											  rollups,
 											  agg_costs,
 											  dNumGroups));
@@ -4524,6 +4540,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  AGG_SORTED,
+										  aggsplit,
 										  gd->rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -5198,7 +5215,15 @@ make_partial_grouping_target(PlannerInfo *root,
 	foreach(lc, grouping_target->exprs)
 	{
 		Expr	   *expr = (Expr *) lfirst(lc);
-		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i);
+		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i++);
+
+		/*
+		 * GroupingFunc does not need to be evaluated in Partial Aggregate,
+		 * since Partial Aggregate will not handle multiple grouping sets at
+		 * once.
+		 */
+		if (IsA(expr, GroupingFunc))
+			continue;
 
 		if (sgref && parse->groupClause &&
 			get_sortgroupref_clause_noerr(sgref, parse->groupClause) != NULL)
@@ -5217,8 +5242,6 @@ make_partial_grouping_target(PlannerInfo *root,
 			 */
 			non_group_cols = lappend(non_group_cols, expr);
 		}
-
-		i++;
 	}
 
 	/*
@@ -6412,7 +6435,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6479,7 +6502,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6514,7 +6544,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6562,17 +6592,27 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 														  dNumGroups);
 
 			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+			{
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, false, true,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6712,13 +6752,15 @@ create_partial_grouping_paths(PlannerInfo *root,
 			get_number_of_groups(root,
 								 cheapest_total_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 	if (cheapest_partial_path != NULL)
 		dNumPartialPartialGroups =
 			get_number_of_groups(root,
 								 cheapest_partial_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 
 	if (can_sort && cheapest_total_path != NULL)
 	{
@@ -6740,11 +6782,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_path(partially_grouped_rel, (Path *)
@@ -6784,11 +6843,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
@@ -6958,11 +7034,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0ac7398..6c1b5d9 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2990,6 +2990,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 Path *subpath,
 						 List *having_qual,
 						 AggStrategy aggstrategy,
+						 AggSplit aggsplit,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups)
@@ -3035,6 +3036,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->aggsplit = aggsplit;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index e3c579e..6b89a12 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1698,6 +1698,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 } GroupingSetsPath;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index 182ffee..6288da8 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -215,6 +215,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  Path *subpath,
 												  List *having_qual,
 												  AggStrategy aggstrategy,
+												  AggSplit aggsplit,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups);
diff --git a/src/test/regress/expected/groupingsets_parallel.out b/src/test/regress/expected/groupingsets_parallel.out
new file mode 100644
index 0000000..9151960
--- /dev/null
+++ b/src/test/regress/expected/groupingsets_parallel.out
@@ -0,0 +1,201 @@
+--
+-- parallel grouping sets
+--
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+create table gstest1(c1 int, c2 int, c3 int);
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+insert into gstest1 select a,b,1 from generate_series(1,100) a, generate_series(1,100) b;
+analyze gstest1;
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+-- negative case
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest1 group by grouping sets((c1),(c2));
+            QUERY PLAN            
+----------------------------------
+ HashAggregate
+   Output: c1, c2, avg(c3)
+   Hash Key: gstest1.c1
+   Hash Key: gstest1.c2
+   ->  Seq Scan on public.gstest1
+         Output: c1, c2, c3
+(6 rows)
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Sort
+   Output: c1, c2, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, (avg(gstest.c3))
+   ->  Finalize HashAggregate
+         Output: c1, c2, avg(c3)
+         Hash Key: gstest.c1, gstest.c2
+         Hash Key: gstest.c1
+         ->  Gather
+               Output: c1, c2, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial HashAggregate
+                     Output: c1, c2, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(15 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Sort
+   Output: c1, c2, c3, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, gstest.c3, (avg(gstest.c3))
+   ->  Finalize HashAggregate
+         Output: c1, c2, c3, avg(c3)
+         Hash Key: gstest.c1, gstest.c2
+         Hash Key: gstest.c1
+         Hash Key: gstest.c2, gstest.c3
+         ->  Gather
+               Output: c1, c2, c3, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial HashAggregate
+                     Output: c1, c2, c3, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(16 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Sort
+   Output: c1, c2, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, (avg(gstest.c3))
+   ->  Finalize GroupAggregate
+         Output: c1, c2, avg(c3)
+         Group Key: gstest.c1, gstest.c2
+         Group Key: gstest.c1
+         ->  Gather Merge
+               Output: c1, c2, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial GroupAggregate
+                     Output: c1, c2, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2
+                     ->  Sort
+                           Output: c1, c2, c3
+                           Sort Key: gstest.c1, gstest.c2
+                           ->  Parallel Seq Scan on public.gstest
+                                 Output: c1, c2, c3
+(18 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Sort
+   Output: c1, c2, c3, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, gstest.c3, (avg(gstest.c3))
+   ->  Finalize GroupAggregate
+         Output: c1, c2, c3, avg(c3)
+         Group Key: gstest.c1, gstest.c2
+         Group Key: gstest.c1
+         Sort Key: gstest.c2, gstest.c3
+           Group Key: gstest.c2, gstest.c3
+         ->  Gather Merge
+               Output: c1, c2, c3, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial GroupAggregate
+                     Output: c1, c2, c3, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Sort
+                           Output: c1, c2, c3
+                           Sort Key: gstest.c1, gstest.c2, gstest.c3
+                           ->  Parallel Seq Scan on public.gstest
+                                 Output: c1, c2, c3
+(20 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+drop table gstest;
+drop table gstest1;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fb55f0..8a88b83 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -87,6 +87,7 @@ test: rules psql psql_crosstab amutils stats_ext
 # run by itself so it can run parallel workers
 test: select_parallel
 test: write_parallel
+test: groupingsets_parallel
 
 # no relation related tests can be put in this group
 test: publication subscription
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a39ca10..4495155 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -140,6 +140,7 @@ test: amutils
 test: stats_ext
 test: select_parallel
 test: write_parallel
+test: groupingsets_parallel
 test: publication
 test: subscription
 test: select_views
diff --git a/src/test/regress/sql/groupingsets_parallel.sql b/src/test/regress/sql/groupingsets_parallel.sql
new file mode 100644
index 0000000..fd71920
--- /dev/null
+++ b/src/test/regress/sql/groupingsets_parallel.sql
@@ -0,0 +1,50 @@
+--
+-- parallel grouping sets
+--
+
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+create table gstest1(c1 int, c2 int, c3 int);
+
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+
+insert into gstest1 select a,b,1 from generate_series(1,100) a, generate_series(1,100) b;
+analyze gstest1;
+
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+
+-- negative case
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest1 group by grouping sets((c1),(c2));
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+
+drop table gstest;
+drop table gstest1;
-- 
2.7.4

#10

Pengzhou Tang

ptang@pivotal.io

over 6 years ago

In reply to: Richard Guo (#9)

1 attachment(s)

Re: Parallel grouping sets

Hi Richard & Tomas:

I followed the idea of the second approach to add a gset_id in the
targetlist of the first stage of
grouping sets and uses it to combine the aggregate in final stage. gset_id
stuff is still kept
because of GROUPING() cannot uniquely identify a grouping set, grouping
sets may contain
duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)).

There are some differences to implement the second approach comparing to
the original idea from
Richard, gset_id is not used as additional group key in the final stage,
instead, we use it to
dispatch the input tuple to the specified grouping set directly and then do
the aggregate.
One advantage of this is that we can handle multiple rollups with better
performance without APPEND node.

the plan now looks like:

gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize MixedAggregate (cost=1000.00..73108.57 rows=8842 width=12)
Dispatched by: (GROUPINGSETID())
Hash Key: c1, c2
Hash Key: c1
Hash Key: c3
Group Key: ()
Group Key: ()
-> Gather (cost=1000.00..71551.48 rows=17684 width=16)
Workers Planned: 2
-> Partial MixedAggregate (cost=0.00..68783.08 rows=8842
width=16)
Hash Key: c1, c2
Hash Key: c1
Hash Key: c3
Group Key: ()
Group Key: ()
-> Parallel Seq Scan on gstest (cost=0.00..47861.33
rows=2083333 width=12)
(16 rows)

gpadmin=# set enable_hashagg to off;
gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=657730.66..663207.45 rows=8842 width=12)
Dispatched by: (GROUPINGSETID())
Group Key: c1, c2
Sort Key: c1
Group Key: c1
Group Key: ()
Group Key: ()
Sort Key: c3
Group Key: c3
-> Sort (cost=657730.66..657774.87 rows=17684 width=16)
Sort Key: c1, c2
-> Gather (cost=338722.94..656483.04 rows=17684 width=16)
Workers Planned: 2
-> Partial GroupAggregate (cost=337722.94..653714.64
rows=8842 width=16)
Group Key: c1, c2
Group Key: c1
Group Key: ()
Group Key: ()
Sort Key: c3
Group Key: c3
-> Sort (cost=337722.94..342931.28 rows=2083333
width=12)
Sort Key: c1, c2
-> Parallel Seq Scan on gstest
(cost=0.00..47861.33 rows=2083333 width=12)

References:
[1]: https://github.com/greenplum-db/postgres/tree/parallel_groupingsets <https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3
<https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3

On Wed, Jul 31, 2019 at 4:07 PM Richard Guo <riguo@pivotal.io> wrote:

Show quoted text

On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <
tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:

On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo@pivotal.io> wrote:

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1] for this
work.

Rebased with the latest master.

Hi Richard,

thanks for the rebased patch. I think the patch is mostly fine (at least I
don't see any serious issues). A couple minor comments:

Hi Tomas,

Thank you for reviewing this patch.

1) I think get_number_of_groups() would deserve a short explanation why
it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
same branch. Before these cases were clearly separated, now it seems a bit
mixed up and it may not be immediately obvious why it's OK.

Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.

2) There are new regression tests, but they are not added to any schedule
(parallel or serial), and so are not executed as part of "make check". I
suppose this is a mistake.

Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.

3) The regression tests do check plan and results like this:

EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
SELECT ... ORDER BY 1, 2, 3;

which however means that the query might easily use a different plan than
what's verified in the eplain (thanks to the additional ORDER BY clause).
So I think this should explain and execute the same query.

(In this case the plans seems to be the same, but that may easily change
in the future, and we could miss it here, failing to verify the results.)

Thank you for pointing this out. Fixed it in V4 patch.

4) It might be a good idea to check the negative case too, i.e. a query on
data set that we should not parallelize (because the number of partial
groups would be too high).

Yes, agree. Added a negative case.

Do you have any plans to hack on the second approach too? AFAICS those two
approaches are complementary (address different data sets / queries), and
it would be nice to have both. One of the things I've been wondering is if
we need to invent gset_id as a new concept, or if we could simply use the
existing GROUPING() function - that uniquely identifies the grouping set.

Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach. I agree with you that we can try GROUPING()
function to see if it can replace gset_id.

Thanks
Richard

Attachments:

0001-Support-for-parallel-grouping-sets.patchapplication/octet-stream; name=0001-Support-for-parallel-grouping-sets.patchDownload

From 1d14a85d90857b5aaca8751bd92eadd0c8e1f2a9 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Tue, 24 Sep 2019 04:22:42 -0400
Subject: [PATCH] Support for parallel grouping sets

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets in multiple workers.

In the first stage, the partial aggregates are performed by
multiple workers, each worker perform the aggregates on all
grouping sets, meanwile, a grouping set id is attached to
the tuples of first stage to identify which grouping set the
tuple belongs to. In the final stage, the gathered tuples are
dispatched to specified grouping set according to the
additional set id and then perform combine aggregates per
grouping set. We don't use GROUPING() func to identify the
grouping set because a sets may contain duplicate grouping
set.

Some changes are also made by executor in final stage:

For AGG_HASHED strategy, all grouping sets still perform
combine aggregates in phase 0, the only difference is that
only one group is selected in final stage, so we need to
skip those un-selected groups.

For AGG_MIXED strategy, phase 0 now also need to do its
own aggregate now.

For AGG_SORTED strategy, rollup will be expanded, eg:
rollup(<c1, c2>, <c1>, <>) is expanded to three rollups:
rollup(<c1, c2>), rollup(<c1>) and rollup(<>). so tuples
can be dispatched to those three phases and do aggregate
then.
---
 src/backend/commands/explain.c          |  10 +-
 src/backend/executor/execExpr.c         |  42 +++-
 src/backend/executor/execExprInterp.c   |  34 +++
 src/backend/executor/nodeAgg.c          | 319 ++++++++++++++++++++++++---
 src/backend/nodes/copyfuncs.c           |  55 ++++-
 src/backend/nodes/equalfuncs.c          |   3 +
 src/backend/nodes/nodeFuncs.c           |   8 +
 src/backend/nodes/outfuncs.c            |  13 +-
 src/backend/nodes/readfuncs.c           |  52 ++++-
 src/backend/optimizer/path/allpaths.c   |   3 +
 src/backend/optimizer/plan/createplan.c |  16 +-
 src/backend/optimizer/plan/planner.c    | 376 +++++++++++++++++++++++---------
 src/backend/optimizer/plan/setrefs.c    |  16 ++
 src/backend/optimizer/util/pathnode.c   |   4 +-
 src/backend/utils/adt/ruleutils.c       |   6 +
 src/include/executor/execExpr.h         |  19 ++
 src/include/executor/nodeAgg.h          |   9 +-
 src/include/nodes/execnodes.h           |  14 ++
 src/include/nodes/nodes.h               |   1 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |   4 +-
 src/include/nodes/primnodes.h           |   6 +
 src/include/optimizer/pathnode.h        |   3 +-
 src/include/optimizer/planmain.h        |   2 +-
 24 files changed, 857 insertions(+), 160 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb343..f1a2e21 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2206,12 +2206,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->grpSetIdFilter)
+		show_expression(plan->grpSetIdFilter, "Dispatched by",
+						astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(astate, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2263,7 +2267,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 6d09f2a..27c8cd9 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -813,7 +813,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -822,6 +822,15 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+				scratch.d.grouping_set_id.parent = (AggState *) state->parent;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
@@ -3214,6 +3223,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 {
 	int			adjust_init_jumpnull = -1;
 	int			adjust_strict_jumpnull = -1;
+	int			adjust_perhash_jumpnull = -1;
 	ExprContext *aggcontext;
 
 	if (ishash)
@@ -3246,6 +3256,30 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		adjust_init_jumpnull = state->steps_len - 1;
 	}
 
+	/*
+	 * All grouping sets that use AGG_HASHED are sent to
+	 * phases zero, when combining the partial aggregate
+	 * results, only one group is select for one tuple,
+	 * so we need to add one more check step to skip not
+	 * selected groups.
+	 */
+	if (ishash && aggstate->grpsetid_filter &&
+		DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		scratch->opcode = EEOP_AGG_PERHASH_NULL_CHECK;
+		scratch->d.agg_perhash_null_check.aggstate = aggstate;
+		scratch->d.agg_perhash_null_check.setno = setno;
+		scratch->d.agg_perhash_null_check.setoff = setoff;
+		scratch->d.agg_perhash_null_check.transno = transno;
+		scratch->d.agg_perhash_null_check.jumpnull = -1;	/* adjust later */
+		ExprEvalPushStep(state, scratch);
+
+		/*
+		 * Note, we don't push into adjust_bailout here - those jump to the
+		 */
+		adjust_perhash_jumpnull = state->steps_len - 1;
+	}
+
 	if (pertrans->numSortCols == 0 &&
 		fcinfo->flinfo->fn_strict)
 	{
@@ -3291,6 +3325,12 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		Assert(as->d.agg_init_trans.jumpnull == -1);
 		as->d.agg_init_trans.jumpnull = state->steps_len;
 	}
+	if (adjust_perhash_jumpnull != -1)
+	{
+		ExprEvalStep *as = &state->steps[adjust_perhash_jumpnull];
+		Assert(as->d.agg_perhash_null_check.jumpnull == -1);
+		as->d.agg_perhash_null_check.jumpnull = state->steps_len;
+	}
 	if (adjust_strict_jumpnull != -1)
 	{
 		ExprEvalStep *as = &state->steps[adjust_strict_jumpnull];
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 66a67c7..0895ad7 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -382,6 +382,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -390,6 +391,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
 		&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
 		&&CASE_EEOP_AGG_INIT_TRANS,
+		&&CASE_EEOP_AGG_PERHASH_NULL_CHECK,
 		&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
 		&&CASE_EEOP_AGG_PLAIN_TRANS,
@@ -1463,6 +1465,21 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			int			grpsetid;		
+			AggState	*aggstate = (AggState *) op->d.grouping_set_id.parent;
+
+			if (aggstate->current_phase == 0)
+				grpsetid = aggstate->perhash[aggstate->current_set].grpsetid;	
+			else
+				grpsetid = aggstate->phase->grpsetids[aggstate->current_set];
+
+			*op->resvalue = grpsetid;
+			*op->resnull = false;
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
@@ -1586,6 +1603,23 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PERHASH_NULL_CHECK)
+		{
+			AggState   *aggstate;
+			AggStatePerGroup pergroup;
+
+			aggstate = op->d.agg_perhash_null_check.aggstate;
+			pergroup = &aggstate->all_pergroups
+				[op->d.agg_perhash_null_check.setoff]
+				[op->d.agg_perhash_null_check.transno];
+
+			/* If transValue has not yet been initialized, do so now. */
+			if (!pergroup)
+				EEO_JUMP(op->d.agg_perhash_null_check.jumpnull);
+
+			EEO_NEXT();
+		}
+
 		/* check that a strict aggregate's input isn't NULL */
 		EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
 		{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a9a1fd0..ba9b3a3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -226,6 +226,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -275,6 +276,7 @@ static void build_hash_table(AggState *aggstate);
 static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_dispatch_input_tuples(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
@@ -313,9 +315,6 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 /*
  * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
@@ -332,6 +331,12 @@ initialize_phase(AggState *aggstate, int newphase)
 		aggstate->sort_in = NULL;
 	}
 
+	if (aggstate->store_in)
+	{
+		tuplestore_end(aggstate->store_in);
+		aggstate->store_in = NULL;	
+	}
+
 	if (newphase <= 1)
 	{
 		/*
@@ -345,21 +350,36 @@ initialize_phase(AggState *aggstate, int newphase)
 	}
 	else
 	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
+		/* 
+		 * When combining partial grouping sets aggregate results, we use
+		 * the sort_in or store_in which contains the dispatched tuples as
+		 * the input. Otherwise, use the the sort_out of previous phase.
 		 */
-		aggstate->sort_in = aggstate->sort_out;
+		if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+		{
+			aggstate->sort_in = aggstate->phases[newphase].sort_in;
+			aggstate->store_in = aggstate->phases[newphase].store_in;
+		}
+		else
+		{
+			aggstate->sort_in = aggstate->sort_out;
+			aggstate->store_in = NULL;
+		}
+
 		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
+		Assert(aggstate->sort_in || aggstate->store_in);
+
+		/* This is the right time to actually sort it. */
+		if (aggstate->sort_in)
+			tuplesort_performsort(aggstate->sort_in);
 	}
 
 	/*
 	 * If this isn't the last phase, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
+	if (aggstate->aggsplit != AGGSPLIT_FINAL_DESERIAL &&
+		newphase > 0 && newphase < aggstate->numphases - 1)
 	{
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
@@ -401,6 +421,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (aggstate->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(aggstate->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -1527,6 +1556,22 @@ lookup_hash_entries(AggState *aggstate)
 	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
 	int			setno;
 
+	if (aggstate->grpsetid_filter)
+	{
+		bool dummynull;
+		int grpsetid = ExecEvalExprSwitchContext(aggstate->grpsetid_filter,
+											   aggstate->tmpcontext,
+											   &dummynull);
+		GrpSetMapping *mapping = &aggstate->grpSetMappings[grpsetid];
+
+		if (!mapping)
+			return;
+
+		select_current_set(aggstate, mapping->index, true);
+		pergroup[mapping->index] = lookup_hash_entry(aggstate)->additional;
+		return;
+	}
+
 	for (setno = 0; setno < numHashes; setno++)
 	{
 		select_current_set(aggstate, setno, true);
@@ -1569,6 +1614,9 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (node->grpsetid_filter && !node->input_dispatched)
+					agg_dispatch_input_tuples(node);
+
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -1680,10 +1728,20 @@ agg_retrieve_direct(AggState *aggstate)
 			else if (aggstate->aggstrategy == AGG_MIXED)
 			{
 				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
+				 * Mixed mode; For non-combine case, we've output all the
+				 * grouped stuff and have full hashtables, so switch to
+				 * outputting those. For combine case, phase one does not
+				 * do this, we need to do our own grouping stuff.
 				 */
 				initialize_phase(aggstate, 0);
+
+				if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+				{
+					/* use the store_in which contians the dispatched tuples */
+					aggstate->store_in = aggstate->phase->store_in;
+					agg_fill_hash_table(aggstate);
+				}
+
 				aggstate->table_filled = true;
 				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
 									   &aggstate->perhash[0].hashiter);
@@ -1838,7 +1896,8 @@ agg_retrieve_direct(AggState *aggstate)
 					 * hashtables as well in advance_aggregates.
 					 */
 					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+						aggstate->current_phase == 1 &&
+						!aggstate->grpsetid_filter)
 					{
 						lookup_hash_entries(aggstate);
 					}
@@ -1921,6 +1980,122 @@ agg_retrieve_direct(AggState *aggstate)
 }
 
 /*
+ * ExecAgg for parallel grouping sets:
+ *
+ * When combining the partial groupingsets aggregate results from workers,
+ * the input is mixed with tuples from different grouping sets. To avoid
+ * unnecessary working, the tuples will be pre-dispatched to according
+ * phases directly.
+ *
+ * This function must be called in phase one which is a AGG_SORTED or
+ * AGG_PLAIN.
+ */
+static void
+agg_dispatch_input_tuples(AggState *aggstate)
+{
+	int	grpsetid;
+	int phase;
+	bool isNull;
+	PlanState *saved_sort;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	GrpSetMapping *mapping;
+	TupleTableSlot *outerslot;
+	AggStatePerPhase perphase;
+
+	/* prepare tuplestore or tuplesort for each phase */
+	for (phase = 0; phase < aggstate->numphases; phase++)
+	{
+		perphase = &aggstate->phases[phase];
+
+		if (!perphase->aggnode)
+			continue;
+
+		if (perphase->aggstrategy == AGG_SORTED)
+		{
+			PlanState *outerNode = outerPlanState(aggstate);
+			TupleDesc tupDesc = ExecGetResultType(outerNode);
+			Sort *sortnode = (Sort *) outerNode->plan;
+
+			Assert(perphase->aggstrategy == AGG_SORTED);
+
+			perphase->sort_in = tuplesort_begin_heap(tupDesc,
+													 sortnode->numCols,
+													 sortnode->sortColIdx,
+													 sortnode->sortOperators,
+													 sortnode->collations,
+													 sortnode->nullsFirst,
+													 work_mem,
+													 NULL, false);
+		}
+		else
+			perphase->store_in = tuplestore_begin_heap(false, false, work_mem);
+	}
+
+	/* 
+	 * If phase one is AGG_SORTED, we cannot perform the sort node beneath it
+	 * directly because it comes from different grouping sets, we need to
+	 * dispatch the tuples first and then do the sort.
+	 *
+	 * To do this, we replace the outerPlan of current AGG node with the child
+	 * node of sort node.
+	 *
+	 * This is unnecessary to AGG_PLAIN.
+	 */
+	if (aggstate->phase->aggstrategy == AGG_SORTED)
+	{
+		saved_sort = outerPlanState(aggstate);
+		outerPlanState(aggstate) = outerPlanState(outerPlanState(aggstate));
+	}
+
+	for (;;)
+	{
+		outerslot = fetch_input_tuple(aggstate);
+		if (TupIsNull(outerslot))
+			break;
+
+		/* set up for advance_aggregates */
+		tmpcontext->ecxt_outertuple = outerslot;
+		grpsetid = ExecEvalExprSwitchContext(aggstate->grpsetid_filter,
+											 tmpcontext,
+											 &isNull);
+
+		/* put the slot to according phase with grouping set id */
+		mapping = &aggstate->grpSetMappings[grpsetid];
+		if (!mapping->is_hashed)
+		{
+			perphase = &aggstate->phases[mapping->index];
+
+			if (perphase->aggstrategy == AGG_SORTED)
+				tuplesort_puttupleslot(perphase->sort_in, outerslot);
+			else
+				tuplestore_puttupleslot(perphase->store_in, outerslot);
+		}
+		else
+			tuplestore_puttupleslot(aggstate->phases[0].store_in, outerslot);
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Restore the outer plan and perform the sorting here. */
+	if (aggstate->phase->aggstrategy == AGG_SORTED)
+	{
+		outerPlanState(aggstate) = saved_sort;
+		tuplesort_performsort(aggstate->phase->sort_in);
+	}
+
+	/*
+	 * Reinitialize the phase one to use the store_in
+	 * or sort_in which contains the dispatched tuples.
+	 */
+	aggstate->sort_in = aggstate->phase->sort_in; 
+	aggstate->store_in = aggstate->phase->store_in; 
+	select_current_set(aggstate, 0, false);
+
+	/* mark the input dispatched */
+	aggstate->input_dispatched = true;
+}
+
+/*
  * ExecAgg for hashed case: read input and build hash table
  */
 static void
@@ -2146,6 +2321,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_dispatched = false;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -2158,16 +2334,16 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * determines the size of some allocations.  Also calculate the number of
 	 * phases, since all hashed/mixed nodes contribute to only a single phase.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			/*
 			 * additional AGG_HASHED aggs become part of phase 0, but all
@@ -2186,6 +2362,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
+	/* 
+	 * When combining the partial groupingsets aggregate results, we
+	 * need a grpsetid mapping to find according perhash or perphase
+	 * data.
+	 */
+	if (DO_AGGSPLIT_COMBINE(node->aggsplit) && node->rollup)
+		aggstate->grpSetMappings = (GrpSetMapping *)
+			palloc0(sizeof(GrpSetMapping) * (numPhases + numHashes));
+
 	/*
 	 * Create expression contexts.  We need three or more, one for
 	 * per-input-tuple processing, one for per-output-tuple processing, one
@@ -2243,8 +2428,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	/*
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 *
+	 * Or we are combining the partial groupingsets aggregate results, input
+	 * belong to AGG_HASHED rollup will use a tuplestore. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		(DO_AGGSPLIT_COMBINE(node->aggsplit) &&
+		 node->aggstrategy == AGG_MIXED))
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2291,6 +2481,14 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
 	/*
+	 * Initialize grouping set id expression to identify which
+	 * grouping set the input tuple belongs to when combining
+	 * partial groupingsets aggregate result.
+	 */
+	aggstate->grpsetid_filter = ExecInitExpr((Expr *) node->grpSetIdFilter,
+											 (PlanState *)aggstate);
+
+	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
 	numaggs = aggstate->numaggs;
@@ -2348,6 +2546,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			/* but the actual Agg node representing this hash is saved here */
 			perhash->aggnode = aggnode;
 
+			if (aggnode->rollup)
+			{
+				GroupingSetData *gs =
+					linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+				perhash->grpsetid = gs->grpsetId;
+
+				/* add a mapping when combining */
+				if (DO_AGGSPLIT_COMBINE(aggnode->aggsplit))
+				{
+					aggstate->grpSetMappings[perhash->grpsetid].is_hashed = true;
+					aggstate->grpSetMappings[perhash->grpsetid].index = i;
+				}
+			}
+
 			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
 
 			for (j = 0; j < aggnode->numCols; ++j)
@@ -2363,18 +2576,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			AggStatePerPhase phasedata = &aggstate->phases[++phase];
 			int			num_sets;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			phasedata->numsets = num_sets = aggnode->rollup ?
+										list_length(aggnode->rollup->gsets) : 0;
 
 			if (num_sets)
 			{
 				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
 				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->grpsetids = palloc(num_sets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int			current_length = list_length(lfirst(l));
 					Bitmapset  *cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2382,12 +2598,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
-
+					phasedata->grpsetids[i] = gs->grpsetId;
 					++i;
 				}
 
 				all_grouped_cols = bms_add_members(all_grouped_cols,
 												   phasedata->grouped_cols[0]);
+
+				/* add a mapping when combining */
+				if (DO_AGGSPLIT_COMBINE(node->aggsplit))
+				{
+					aggstate->grpSetMappings[phasedata->grpsetids[0]].is_hashed = false;
+					aggstate->grpSetMappings[phasedata->grpsetids[0]].index = phase;
+				}
 			}
 			else
 			{
@@ -2871,23 +3094,50 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		if (!phase->aggnode)
 			continue;
 
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
+		if (aggstate->aggstrategy == AGG_MIXED &&
+			phaseidx == 1)
 		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
+			if (!DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+			{
+				/*
+				 * Phase one, and only phase one, in a mixed agg performs both
+				 * sorting and aggregation.
+				 */
+				dohash = true;
+				dosort = true;
+			}
+			else
+			{
+				/*
+				 * When combining partial groupingsets aggregate results, input
+				 * is dispatched according to the grouping set id, we cannot
+				 * perform both sorting and hashing aggregation in one phase,
+				 * just perform the sorting aggregation.
+				 */	
+				dohash = false;
+				dosort = true;
+			}
 		}
 		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
 		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
-			continue;
+			if (!DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+			{
+				/*
+				 * No need to compute a transition function for an AGG_MIXED phase
+				 * 0 - the contents of the hashtables will have been computed
+				 * during phase 1.
+				 */
+				continue;
+			}
+			else
+			{
+				/*
+				 * When combining partial groupingsets aggregate results, phase
+				 * 0 need to do its own hashing aggregate.
+				 */
+				dohash = true;
+				dosort = false;
+			}
 		}
 		else if (phase->aggstrategy == AGG_PLAIN ||
 				 phase->aggstrategy == AGG_SORTED)
@@ -3440,6 +3690,7 @@ ExecReScanAgg(AggState *node)
 	int			setno;
 
 	node->agg_done = false;
+	node->input_dispatched = false;
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7..d3ec4b5 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -986,7 +986,7 @@ _copyAgg(const Agg *from)
 	}
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
 
 	return newnode;
@@ -1474,6 +1474,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 }
 
 /*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(grpsetId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
+/*
  * _copyWindowFunc
  */
 static WindowFunc *
@@ -4938,6 +4982,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5568,6 +5615,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 4f2ebe5..dec6d4f 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3049,6 +3049,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 18bd5ac..8dc702f 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -63,6 +63,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -741,6 +744,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1870,6 +1876,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2506,6 +2513,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2..b3ff513 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -781,7 +781,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_OID_ARRAY(grpCollations, node->numCols);
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
 }
 
@@ -1146,6 +1146,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 }
 
 static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
+static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
 	WRITE_NODE_TYPE("WINDOWFUNC");
@@ -1996,6 +2003,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(grpsetId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3824,6 +3832,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb..4f76957 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -637,6 +637,50 @@ _readGroupingFunc(void)
 }
 
 /*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(grpsetId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
  * _readWindowFunc
  */
 static WindowFunc *
@@ -2171,7 +2215,7 @@ _readAgg(void)
 	READ_OID_ARRAY(grpCollations, local_node->numCols);
 	READ_LONG_FIELD(numGroups);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
 
 	READ_DONE();
@@ -2607,6 +2651,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a..a357f37 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2708,6 +2708,9 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 						   NULL, rowsp);
 	add_path(rel, simple_gather_path);
 
+	if (root->parse->groupingSets)
+		return;
+
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
 	 * Merge.
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c03620..6fb1a98 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1639,7 +1639,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 subplan);
@@ -2091,7 +2091,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					subplan);
@@ -2247,12 +2247,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 sort_plan);
@@ -2285,12 +2285,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						subplan);
@@ -6189,7 +6189,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain,
+		 RollupData *rollup, List *chain,
 		 double dNumGroups, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6207,7 +6207,7 @@ make_agg(List *tlist, List *qual,
 	node->grpCollations = grpCollations;
 	node->numGroups = numGroups;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f08..f147cac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -107,6 +107,7 @@ typedef struct
 typedef struct
 {
 	List	   *rollups;
+	List	   *final_rollups;
 	List	   *hash_sets_idx;
 	double		dNumHashGroups;
 	bool		any_hashable;
@@ -114,6 +115,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   numGroupingSets;
 } grouping_sets_data;
 
 /*
@@ -127,6 +129,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*add_path_callback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -143,7 +147,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -176,7 +181,10 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										List *havingQual,
+										AggSplit aggsplit);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -2437,6 +2445,8 @@ preprocess_grouping_sets(PlannerInfo *root)
 	int			maxref = 0;
 	ListCell   *lc;
 	ListCell   *lc_set;
+	ListCell   *lc_rollup;
+	RollupData *rollup;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
 	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
@@ -2488,6 +2498,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->grpsetId = gd->numGroupingSets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2519,8 +2530,8 @@ preprocess_grouping_sets(PlannerInfo *root)
 	foreach(lc_set, sets)
 	{
 		List	   *current_sets = (List *) lfirst(lc_set);
-		RollupData *rollup = makeNode(RollupData);
 		GroupingSetData *gs;
+		rollup = makeNode(RollupData);
 
 		/*
 		 * Reorder the current list of grouping sets into correct prefix
@@ -2532,7 +2543,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -2584,6 +2595,33 @@ preprocess_grouping_sets(PlannerInfo *root)
 		gd->rollups = lappend(gd->rollups, rollup);
 	}
 
+	/* divide rollups to xxx */
+	foreach(lc_rollup, gd->rollups)
+	{
+		RollupData *initial_rollup = lfirst(lc_rollup);
+
+		foreach(lc, initial_rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst(lc);
+			rollup = makeNode(RollupData);
+
+			if (gs->set == NIL)
+				rollup->groupClause = NIL;	
+			else
+				rollup->groupClause = preprocess_groupclause(root, gs->set);
+			rollup->gsets_data = list_make1(gs);
+			rollup->gsets = remap_to_groupclause_idx(rollup->groupClause,
+													 rollup->gsets_data,
+													 gd->tleref_to_colnum_map);
+
+			rollup->numGroups = gs->numGroups;
+			rollup->hashable = initial_rollup->hashable;
+			rollup->is_hashed = initial_rollup->is_hashed;
+
+			gd->final_rollups = lappend(gd->final_rollups, rollup);
+		}
+	}
+
 	if (gd->unsortable_sets)
 	{
 		/*
@@ -3541,7 +3579,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3575,6 +3613,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->grpsetId = gd->numGroupingSets++;
 		result = lcons(gs, result);
 	}
 
@@ -3725,6 +3764,30 @@ get_number_of_groups(PlannerInfo *root,
 				dNumGroups += rollup->numGroups;
 			}
 
+			foreach(lc, gd->final_rollups)
+			{
+				RollupData *rollup = lfirst_node(RollupData, lc);
+				ListCell   *lc;
+
+				groupExprs = get_sortgrouplist_exprs(rollup->groupClause,
+													 target_list);
+
+				rollup->numGroups = 0.0;
+
+				forboth(lc, rollup->gsets, lc2, rollup->gsets_data)
+				{
+					List	   *gset = (List *) lfirst(lc);
+					GroupingSetData *gs = lfirst_node(GroupingSetData, lc2);
+					double		numGroups = estimate_num_groups(root,
+																groupExprs,
+																path_rows,
+																&gset);
+
+					gs->numGroups = numGroups;
+					rollup->numGroups += numGroups;
+				}
+			}
+
 			if (gd->hash_sets_idx)
 			{
 				ListCell   *lc;
@@ -4190,9 +4253,26 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							List *havingQual,
+							AggSplit aggsplit)
 {
-	Query	   *parse = root->parse;
+	/* For partial path, add it to partial_pathlist */
+	add_path_callback add_path_cb =
+		(aggsplit == AGGSPLIT_INITIAL_SERIAL) ? add_partial_path : add_path;
+
+	/* 
+	 * If we are combining the partial groupingsets aggregation, the input is
+	 * mixed with tuples from different grouping sets, executor dispatch the
+	 * tuples to different rollups (phases) according to the grouping set id.
+	 *
+	 * We cannot use the same rollups with initial stage in which each tuple
+	 * is processed by one or more grouping sets in one rollup, because in
+	 * combining stage, each tuple only belong to one single grouping set.
+	 * In this case, we use final_rollups instead in which each rollup has
+	 * only one grouping set.
+	 */
+	List *rollups = DO_AGGSPLIT_COMBINE(aggsplit) ? gd->final_rollups : gd->rollups;
 
 	/*
 	 * If we're not being offered sorted input, then only consider plans that
@@ -4213,7 +4293,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		List	   *empty_sets_data = NIL;
 		List	   *empty_sets = NIL;
 		ListCell   *lc;
-		ListCell   *l_start = list_head(gd->rollups);
+		ListCell   *l_start = list_head(rollups);
 		AggStrategy strat = AGG_HASHED;
 		double		hashsize;
 		double		exclude_groups = 0.0;
@@ -4245,7 +4325,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		{
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
-			l_start = lnext(gd->rollups, l_start);
+			l_start = lnext(rollups, l_start);
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4253,11 +4333,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  dNumGroups - exclude_groups);
 
 		/*
-		 * gd->rollups is empty if we have only unsortable columns to work
+		 * rollups is empty if we have only unsortable columns to work
 		 * with.  Override work_mem in that case; otherwise, we'll rely on the
 		 * sorted-input case to generate usable mixed paths.
 		 */
-		if (hashsize > work_mem * 1024L && gd->rollups)
+		if (hashsize > work_mem * 1024L && rollups)
 			return;				/* nope, won't fit */
 
 		/*
@@ -4266,7 +4346,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		 */
 		sets_data = list_copy(gd->unsortable_sets);
 
-		for_each_cell(lc, gd->rollups, l_start)
+		for_each_cell(lc, rollups, l_start)
 		{
 			RollupData *rollup = lfirst_node(RollupData, lc);
 
@@ -4334,34 +4414,60 @@ consider_groupingsets_paths(PlannerInfo *root,
 		}
 		else if (empty_sets)
 		{
-			RollupData *rollup = makeNode(RollupData);
+			/*
+			 * If we are doing combining, each empty set is made to a single
+			 * rollup, otherwise, all empty sets are made to one rollup.
+			 */
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+			{
+				ListCell *lc2;
+				forboth(lc, empty_sets, lc2, empty_sets_data)
+				{
+					GroupingSetData *gs = lfirst_node(GroupingSetData, lc2);
+					RollupData *rollup = makeNode(RollupData);
+
+					rollup->groupClause = NIL;
+					rollup->gsets_data = list_make1(gs); 
+					rollup->gsets = list_make1(NIL);
+					rollup->numGroups = 1;
+					rollup->hashable = false;
+					rollup->is_hashed = false;
+					new_rollups = lappend(new_rollups, rollup);
+				}
+			}
+			else
+			{
+				RollupData *rollup = makeNode(RollupData);
+
+				rollup->groupClause = NIL;
+				rollup->gsets_data = empty_sets_data;
+				rollup->gsets = empty_sets;
+				rollup->numGroups = list_length(empty_sets);
+				rollup->hashable = false;
+				rollup->is_hashed = false;
+				new_rollups = lappend(new_rollups, rollup);
+			}
 
-			rollup->groupClause = NIL;
-			rollup->gsets_data = empty_sets_data;
-			rollup->gsets = empty_sets;
-			rollup->numGroups = list_length(empty_sets);
-			rollup->hashable = false;
-			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups));
+		add_path_cb(grouped_rel, (Path *)
+					  create_groupingsets_path(root,
+											   grouped_rel,
+											   path,
+											   havingQual,
+											   strat,
+											   new_rollups,
+											   agg_costs,
+											   dNumGroups,
+											   aggsplit));
 		return;
 	}
 
 	/*
 	 * If we have sorted input but nothing we can do with it, bail.
 	 */
-	if (list_length(gd->rollups) == 0)
+	if (list_length(rollups) == 0)
 		return;
 
 	/*
@@ -4374,7 +4480,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 */
 	if (can_hash && gd->any_hashable)
 	{
-		List	   *rollups = NIL;
+		List	   *mixed_rollups = NIL;
 		List	   *hash_sets = list_copy(gd->unsortable_sets);
 		double		availspace = (work_mem * 1024.0);
 		ListCell   *lc;
@@ -4386,10 +4492,10 @@ consider_groupingsets_paths(PlannerInfo *root,
 												 agg_costs,
 												 gd->dNumHashGroups);
 
-		if (availspace > 0 && list_length(gd->rollups) > 1)
+		if (availspace > 0 && list_length(rollups) > 1)
 		{
 			double		scale;
-			int			num_rollups = list_length(gd->rollups);
+			int			num_rollups = list_length(rollups);
 			int			k_capacity;
 			int		   *k_weights = palloc(num_rollups * sizeof(int));
 			Bitmapset  *hash_items = NULL;
@@ -4427,11 +4533,13 @@ consider_groupingsets_paths(PlannerInfo *root,
 			 * below, must use the same condition.
 			 */
 			i = 0;
-			for_each_cell(lc, gd->rollups, list_second_cell(gd->rollups))
+			for_each_cell(lc, rollups, list_second_cell(rollups))
 			{
 				RollupData *rollup = lfirst_node(RollupData, lc);
 
-				if (rollup->hashable)
+				/* Empty set cannot be hashed either */
+				if (rollup->hashable &&
+					list_length(linitial(rollup->gsets)) != 0)
 				{
 					double		sz = estimate_hashagg_tablesize(path,
 																agg_costs,
@@ -4458,30 +4566,31 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			if (!bms_is_empty(hash_items))
 			{
-				rollups = list_make1(linitial(gd->rollups));
+				mixed_rollups = list_make1(linitial(rollups));
 
 				i = 0;
-				for_each_cell(lc, gd->rollups, list_second_cell(gd->rollups))
+				for_each_cell(lc, rollups, list_second_cell(rollups))
 				{
 					RollupData *rollup = lfirst_node(RollupData, lc);
 
-					if (rollup->hashable)
+					if (rollup->hashable &&
+						list_length(linitial(rollup->gsets)) != 0)
 					{
 						if (bms_is_member(i, hash_items))
 							hash_sets = list_concat(hash_sets,
 													rollup->gsets_data);
 						else
-							rollups = lappend(rollups, rollup);
+							mixed_rollups = lappend(mixed_rollups, rollup);
 						++i;
 					}
 					else
-						rollups = lappend(rollups, rollup);
+						mixed_rollups = lappend(mixed_rollups, rollup);
 				}
 			}
 		}
 
-		if (!rollups && hash_sets)
-			rollups = list_copy(gd->rollups);
+		if (!mixed_rollups && hash_sets)
+			mixed_rollups = list_copy(rollups);
 
 		foreach(lc, hash_sets)
 		{
@@ -4498,20 +4607,21 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			mixed_rollups = lcons(rollup, mixed_rollups);
 		}
 
-		if (rollups)
+		if (mixed_rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups));
+			add_path_cb(grouped_rel, (Path *)
+						  create_groupingsets_path(root,
+												   grouped_rel,
+												   path,
+												   havingQual,
+												   AGG_MIXED,
+												   mixed_rollups,
+												   agg_costs,
+												   dNumGroups,
+												   aggsplit));
 		}
 	}
 
@@ -4519,15 +4629,16 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups));
+		add_path_cb(grouped_rel, (Path *)
+					  create_groupingsets_path(root,
+											   grouped_rel,
+											   path,
+											   havingQual,
+											   AGG_SORTED,
+											   rollups,
+											   agg_costs,
+											   dNumGroups,
+											   aggsplit));
 }
 
 /*
@@ -5242,6 +5353,13 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
+	/* Add  */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
 	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
@@ -6412,7 +6530,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups,
+												havingQual,
+												AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6479,7 +6599,15 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGGSPLIT_FINAL_DESERIAL);
+				}
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6514,7 +6642,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										havingQual,
+										AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6557,22 +6687,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGGSPLIT_FINAL_DESERIAL);
+			}
+			else
+			{
 
-			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
+
+				if (hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6789,8 +6934,16 @@ create_partial_grouping_paths(PlannerInfo *root,
 													 path,
 													 root->group_pathkeys,
 													 -1.0);
-
-				if (parse->hasAggs)
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, true, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGGSPLIT_INITIAL_SERIAL);
+				}
+				else if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
 									 create_agg_path(root,
 													 partially_grouped_rel,
@@ -6851,26 +7004,39 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if (hashaggtablesize < work_mem * 1024L &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGGSPLIT_INITIAL_SERIAL);
+		}
+		else 
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if (hashaggtablesize < work_mem * 1024L &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6913,6 +7079,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
 	generate_gather_paths(root, rel, true);
 
+	if (root->parse->groupingSets)
+		return;
+
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
@@ -6958,11 +7127,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 566ee96..d8723b9 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -728,6 +728,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If this node is combining partial-groupingsets-aggregation,
+					 * we must add reference to the GroupingSetsId expression in
+					 * the targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->grpSetIdFilter = fix_upper_expr(root, (Node *)expr,
+															 subplan_itlist,
+															 OUTER_VAR,
+															 rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb73..578ad60 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2992,7 +2992,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 AggSplit aggsplit)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3010,6 +3011,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit= aggsplit;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3e64390..f3e5766 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -7874,6 +7874,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index d21dbead..1361955 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
@@ -226,6 +227,7 @@ typedef enum ExprEvalOp
 	EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
 	EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
 	EEOP_AGG_INIT_TRANS,
+	EEOP_AGG_PERHASH_NULL_CHECK,
 	EEOP_AGG_STRICT_TRANS_CHECK,
 	EEOP_AGG_PLAIN_TRANS_BYVAL,
 	EEOP_AGG_PLAIN_TRANS,
@@ -573,6 +575,12 @@ typedef struct ExprEvalStep
 			List	   *clauses;	/* integer list of column numbers */
 		}			grouping_func;
 
+		/* for EEOP_GROUPING_SET_ID */
+		struct
+		{
+			AggState   *parent; /* parent Agg */
+		}			grouping_set_id;
+
 		/* for EEOP_WINDOW_FUNC */
 		struct
 		{
@@ -634,6 +642,17 @@ typedef struct ExprEvalStep
 			int			jumpnull;
 		}			agg_init_trans;
 
+		/* for EEOP_AGG_PERHASH_NULL_CHECK */
+		struct
+		{
+			AggState   *aggstate;
+			AggStatePerTrans pertrans;
+			int			setno;
+			int			transno;
+			int			setoff;
+			int			jumpnull;
+		}			agg_perhash_null_check;
+
 		/* for EEOP_AGG_STRICT_TRANS_CHECK */
 		struct
 		{
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 1a8ca98..4e5ec06 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -280,6 +280,11 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+
+	/* field for parallel grouping sets */
+	int *grpsetids;
+	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
+	Tuplestorestate *store_in;	/* sorted input to phases > 1 */
 }			AggStatePerPhaseData;
 
 /*
@@ -302,8 +307,10 @@ typedef struct AggStatePerHashData
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
 	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
 
+	/* field for parallel grouping sets */
+	int grpsetid;
+}			AggStatePerHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 063b490..0ba408e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1972,6 +1972,13 @@ typedef struct GroupState
  *	expressions and run the aggregate transition functions.
  * ---------------------
  */
+/* mapping from grouping set id to perphase or perhash data */
+typedef struct GrpSetMapping
+{
+	bool	is_hashed;
+	int		index; 		/* index of aggstate->perhash[] or aggstate->phases[]*/
+} GrpSetMapping;
+
 /* these structs are private in nodeAgg.c: */
 typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
@@ -2013,6 +2020,7 @@ typedef struct AggState
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
+	Tuplestorestate *store_in;	/* sorted input to phases > 1 */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
 									 * pointers */
@@ -2029,6 +2037,12 @@ typedef struct AggState
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* support for parallel grouping sets */
+	bool input_dispatched;
+	ExprState *grpsetid_filter;				/* filter to fetch grouping set id
+											   from child targetlist */
+	struct GrpSetMapping *grpSetMappings;	/* grpsetid <-> perhash or perphase data */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 3cbb08d..9594201 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 13b147d..6d6fc55 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1674,6 +1674,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			grpsetId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1699,6 +1700,7 @@ typedef struct GroupingSetsPath
 	AggStrategy aggstrategy;	/* basic strategy */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
+	AggSplit   aggsplit;
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e..74e8fb5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -811,8 +812,9 @@ typedef struct Agg
 	long		numGroups;		/* estimated number of groups in input */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Node	   *grpSetIdFilter;
 } Agg;
 
 /* ----------------
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 860a84d..e96cbfc 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -350,6 +350,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* add comment */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54..900070b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  AggSplit aggsplit);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index e7aaddd..b28476f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain,
+					 RollupData *rollup, List *chain,
 					 double dNumGroups, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
-- 
1.8.3.1

#11

Pengzhou Tang

ptang@pivotal.io

about 6 years ago

In reply to: Pengzhou Tang (#10)

1 attachment(s)

Re: Parallel grouping sets

Hi Hackers,

Richard pointed out that he get incorrect results with the patch I
attached, there are bugs somewhere,
I fixed them now and attached the newest version, please refer to [1]https://github.com/greenplum-db/postgres/tree/parallel_groupingsets <https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3 for
the fix.

Thanks,
Pengzhou

On Mon, Sep 30, 2019 at 5:41 PM Pengzhou Tang <ptang@pivotal.io> wrote:

Show quoted text

Hi Richard & Tomas:

I followed the idea of the second approach to add a gset_id in the
targetlist of the first stage of
grouping sets and uses it to combine the aggregate in final stage. gset_id
stuff is still kept
because of GROUPING() cannot uniquely identify a grouping set, grouping
sets may contain
duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)).

There are some differences to implement the second approach comparing to
the original idea from
Richard, gset_id is not used as additional group key in the final stage,
instead, we use it to
dispatch the input tuple to the specified grouping set directly and then
do the aggregate.
One advantage of this is that we can handle multiple rollups with better
performance without APPEND node.

the plan now looks like:

gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
QUERY PLAN

--------------------------------------------------------------------------------------------
Finalize MixedAggregate (cost=1000.00..73108.57 rows=8842 width=12)
Dispatched by: (GROUPINGSETID())
Hash Key: c1, c2
Hash Key: c1
Hash Key: c3
Group Key: ()
Group Key: ()
-> Gather (cost=1000.00..71551.48 rows=17684 width=16)
Workers Planned: 2
-> Partial MixedAggregate (cost=0.00..68783.08 rows=8842
width=16)
Hash Key: c1, c2
Hash Key: c1
Hash Key: c3
Group Key: ()
Group Key: ()
-> Parallel Seq Scan on gstest (cost=0.00..47861.33
rows=2083333 width=12)
(16 rows)

gpadmin=# set enable_hashagg to off;
gpadmin=# explain select c1, c2 from gstest group by grouping
sets(rollup(c1, c2), rollup(c3));
QUERY PLAN

--------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=657730.66..663207.45 rows=8842 width=12)
Dispatched by: (GROUPINGSETID())
Group Key: c1, c2
Sort Key: c1
Group Key: c1
Group Key: ()
Group Key: ()
Sort Key: c3
Group Key: c3
-> Sort (cost=657730.66..657774.87 rows=17684 width=16)
Sort Key: c1, c2
-> Gather (cost=338722.94..656483.04 rows=17684 width=16)
Workers Planned: 2
-> Partial GroupAggregate (cost=337722.94..653714.64
rows=8842 width=16)
Group Key: c1, c2
Group Key: c1
Group Key: ()
Group Key: ()
Sort Key: c3
Group Key: c3
-> Sort (cost=337722.94..342931.28 rows=2083333
width=12)
Sort Key: c1, c2
-> Parallel Seq Scan on gstest
(cost=0.00..47861.33 rows=2083333 width=12)

References:
[1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets
<https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3

On Wed, Jul 31, 2019 at 4:07 PM Richard Guo <riguo@pivotal.io> wrote:

On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra <
tomas.vondra@2ndquadrant.com> wrote:

On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote:

On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <riguo@pivotal.io> wrote:

Hi all,

Paul and I have been hacking recently to implement parallel grouping
sets, and here we have two implementations.

Implementation 1
================

Attached is the patch and also there is a github branch [1] for this
work.

Rebased with the latest master.

Hi Richard,

thanks for the rebased patch. I think the patch is mostly fine (at least
I
don't see any serious issues). A couple minor comments:

Hi Tomas,

Thank you for reviewing this patch.

1) I think get_number_of_groups() would deserve a short explanation why
it's OK to handle (non-partial) grouping sets and regular GROUP BY in the
same branch. Before these cases were clearly separated, now it seems a
bit
mixed up and it may not be immediately obvious why it's OK.

Added a short comment in get_number_of_groups() explaining the behavior
when doing partial aggregation for grouping sets.

2) There are new regression tests, but they are not added to any schedule
(parallel or serial), and so are not executed as part of "make check". I
suppose this is a mistake.

Yes, thanks. Added the new regression test in parallel_schedule and
serial_schedule.

3) The regression tests do check plan and results like this:

EXPLAIN (COSTS OFF, VERBOSE) SELECT ...;
SELECT ... ORDER BY 1, 2, 3;

which however means that the query might easily use a different plan than
what's verified in the eplain (thanks to the additional ORDER BY clause).
So I think this should explain and execute the same query.

(In this case the plans seems to be the same, but that may easily change
in the future, and we could miss it here, failing to verify the results.)

Thank you for pointing this out. Fixed it in V4 patch.

4) It might be a good idea to check the negative case too, i.e. a query
on
data set that we should not parallelize (because the number of partial
groups would be too high).

Yes, agree. Added a negative case.

Do you have any plans to hack on the second approach too? AFAICS those
two
approaches are complementary (address different data sets / queries), and
it would be nice to have both. One of the things I've been wondering is
if
we need to invent gset_id as a new concept, or if we could simply use the
existing GROUPING() function - that uniquely identifies the grouping set.

Yes, I'm planning to hack on the second approach in short future. I'm
also reconsidering the gset_id stuff since it brings a lot of complexity
for the second approach. I agree with you that we can try GROUPING()
function to see if it can replace gset_id.

Thanks
Richard

Attachments:

v2-0001-Support-for-parallel-grouping-sets.patchapplication/octet-stream; name=v2-0001-Support-for-parallel-grouping-sets.patchDownload

From 50511ff75b680fcee27ffe2a8824b0686ed2d1db Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Tue, 24 Sep 2019 04:22:42 -0400
Subject: [PATCH] Support for parallel grouping sets

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets in multiple workers.

In the first stage, the partial aggregates are performed by
multiple workers, each worker perform the aggregates on all
grouping sets, meanwile, a grouping set id is attached to
the tuples of first stage to identify which grouping set the
tuple belongs to. In the final stage, the gathered tuples are
dispatched to specified grouping set according to the
additional set id and then perform combine aggregates per
grouping set. We don't use GROUPING() func to identify the
grouping set because a sets may contain duplicate grouping
set.

Some changes are also made by executor in final stage:

For AGG_HASHED strategy, all grouping sets still perform
combine aggregates in phase 0, the only difference is that
only one group is selected in final stage, so we need to
skip those un-selected groups.

For AGG_MIXED strategy, phase 0 now also need to do its
own aggregate now.

For AGG_SORTED strategy, rollup will be expanded, eg:
rollup(<c1, c2>, <c1>, <>) is expanded to three rollups:
rollup(<c1, c2>), rollup(<c1>) and rollup(<>). so tuples
can be dispatched to those three phases and do aggregate
then.
---
 src/backend/commands/explain.c          |  10 +-
 src/backend/executor/execExpr.c         |  42 +++-
 src/backend/executor/execExprInterp.c   |  34 +++
 src/backend/executor/nodeAgg.c          | 330 +++++++++++++++++++++++++---
 src/backend/nodes/copyfuncs.c           |  55 ++++-
 src/backend/nodes/equalfuncs.c          |   3 +
 src/backend/nodes/nodeFuncs.c           |   8 +
 src/backend/nodes/outfuncs.c            |  13 +-
 src/backend/nodes/readfuncs.c           |  52 ++++-
 src/backend/optimizer/path/allpaths.c   |   3 +
 src/backend/optimizer/plan/createplan.c |  18 +-
 src/backend/optimizer/plan/planner.c    | 376 +++++++++++++++++++++++---------
 src/backend/optimizer/plan/setrefs.c    |  16 ++
 src/backend/optimizer/util/pathnode.c   |   4 +-
 src/backend/utils/adt/ruleutils.c       |   6 +
 src/include/executor/execExpr.h         |  19 ++
 src/include/executor/nodeAgg.h          |   9 +-
 src/include/nodes/execnodes.h           |  14 ++
 src/include/nodes/nodes.h               |   1 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |   4 +-
 src/include/nodes/primnodes.h           |   6 +
 src/include/optimizer/pathnode.h        |   3 +-
 src/include/optimizer/planmain.h        |   2 +-
 24 files changed, 869 insertions(+), 161 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 62fb343..f1a2e21 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2206,12 +2206,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->grpSetIdFilter)
+		show_expression(plan->grpSetIdFilter, "Dispatched by",
+						astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(astate, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2263,7 +2267,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 6d09f2a..27c8cd9 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -813,7 +813,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -822,6 +822,15 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+				scratch.d.grouping_set_id.parent = (AggState *) state->parent;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
@@ -3214,6 +3223,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 {
 	int			adjust_init_jumpnull = -1;
 	int			adjust_strict_jumpnull = -1;
+	int			adjust_perhash_jumpnull = -1;
 	ExprContext *aggcontext;
 
 	if (ishash)
@@ -3246,6 +3256,30 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		adjust_init_jumpnull = state->steps_len - 1;
 	}
 
+	/*
+	 * All grouping sets that use AGG_HASHED are sent to
+	 * phases zero, when combining the partial aggregate
+	 * results, only one group is select for one tuple,
+	 * so we need to add one more check step to skip not
+	 * selected groups.
+	 */
+	if (ishash && aggstate->grpsetid_filter &&
+		DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		scratch->opcode = EEOP_AGG_PERHASH_NULL_CHECK;
+		scratch->d.agg_perhash_null_check.aggstate = aggstate;
+		scratch->d.agg_perhash_null_check.setno = setno;
+		scratch->d.agg_perhash_null_check.setoff = setoff;
+		scratch->d.agg_perhash_null_check.transno = transno;
+		scratch->d.agg_perhash_null_check.jumpnull = -1;	/* adjust later */
+		ExprEvalPushStep(state, scratch);
+
+		/*
+		 * Note, we don't push into adjust_bailout here - those jump to the
+		 */
+		adjust_perhash_jumpnull = state->steps_len - 1;
+	}
+
 	if (pertrans->numSortCols == 0 &&
 		fcinfo->flinfo->fn_strict)
 	{
@@ -3291,6 +3325,12 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		Assert(as->d.agg_init_trans.jumpnull == -1);
 		as->d.agg_init_trans.jumpnull = state->steps_len;
 	}
+	if (adjust_perhash_jumpnull != -1)
+	{
+		ExprEvalStep *as = &state->steps[adjust_perhash_jumpnull];
+		Assert(as->d.agg_perhash_null_check.jumpnull == -1);
+		as->d.agg_perhash_null_check.jumpnull = state->steps_len;
+	}
 	if (adjust_strict_jumpnull != -1)
 	{
 		ExprEvalStep *as = &state->steps[adjust_strict_jumpnull];
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 66a67c7..0895ad7 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -382,6 +382,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -390,6 +391,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
 		&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
 		&&CASE_EEOP_AGG_INIT_TRANS,
+		&&CASE_EEOP_AGG_PERHASH_NULL_CHECK,
 		&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
 		&&CASE_EEOP_AGG_PLAIN_TRANS,
@@ -1463,6 +1465,21 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			int			grpsetid;		
+			AggState	*aggstate = (AggState *) op->d.grouping_set_id.parent;
+
+			if (aggstate->current_phase == 0)
+				grpsetid = aggstate->perhash[aggstate->current_set].grpsetid;	
+			else
+				grpsetid = aggstate->phase->grpsetids[aggstate->current_set];
+
+			*op->resvalue = grpsetid;
+			*op->resnull = false;
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
@@ -1586,6 +1603,23 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PERHASH_NULL_CHECK)
+		{
+			AggState   *aggstate;
+			AggStatePerGroup pergroup;
+
+			aggstate = op->d.agg_perhash_null_check.aggstate;
+			pergroup = &aggstate->all_pergroups
+				[op->d.agg_perhash_null_check.setoff]
+				[op->d.agg_perhash_null_check.transno];
+
+			/* If transValue has not yet been initialized, do so now. */
+			if (!pergroup)
+				EEO_JUMP(op->d.agg_perhash_null_check.jumpnull);
+
+			EEO_NEXT();
+		}
+
 		/* check that a strict aggregate's input isn't NULL */
 		EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
 		{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a9a1fd0..7fc1cf8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -226,6 +226,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -275,6 +276,7 @@ static void build_hash_table(AggState *aggstate);
 static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_dispatch_input_tuples(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
@@ -313,9 +315,6 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 /*
  * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
@@ -332,6 +331,12 @@ initialize_phase(AggState *aggstate, int newphase)
 		aggstate->sort_in = NULL;
 	}
 
+	if (aggstate->store_in)
+	{
+		tuplestore_end(aggstate->store_in);
+		aggstate->store_in = NULL;	
+	}
+
 	if (newphase <= 1)
 	{
 		/*
@@ -345,21 +350,36 @@ initialize_phase(AggState *aggstate, int newphase)
 	}
 	else
 	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
+		/* 
+		 * When combining partial grouping sets aggregate results, we use
+		 * the sort_in or store_in which contains the dispatched tuples as
+		 * the input. Otherwise, use the the sort_out of previous phase.
 		 */
-		aggstate->sort_in = aggstate->sort_out;
+		if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+		{
+			aggstate->sort_in = aggstate->phases[newphase].sort_in;
+			aggstate->store_in = aggstate->phases[newphase].store_in;
+		}
+		else
+		{
+			aggstate->sort_in = aggstate->sort_out;
+			aggstate->store_in = NULL;
+		}
+
 		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
+		Assert(aggstate->sort_in || aggstate->store_in);
+
+		/* This is the right time to actually sort it. */
+		if (aggstate->sort_in)
+			tuplesort_performsort(aggstate->sort_in);
 	}
 
 	/*
 	 * If this isn't the last phase, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
+	if (aggstate->aggsplit != AGGSPLIT_FINAL_DESERIAL &&
+		newphase > 0 && newphase < aggstate->numphases - 1)
 	{
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
@@ -401,6 +421,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (aggstate->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(aggstate->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -1527,6 +1556,33 @@ lookup_hash_entries(AggState *aggstate)
 	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
 	int			setno;
 
+	if (aggstate->grpsetid_filter)
+	{
+		bool dummynull;
+		int grpsetid = ExecEvalExprSwitchContext(aggstate->grpsetid_filter,
+											   aggstate->tmpcontext,
+											   &dummynull);
+		GrpSetMapping *mapping = &aggstate->grpSetMappings[grpsetid];
+
+		if (!mapping)
+			return;
+
+		for (setno = 0; setno < numHashes; setno++)
+		{
+			if (setno == mapping->index)
+			{
+				select_current_set(aggstate, setno, true);
+				pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+			}
+			else
+			{
+				pergroup[setno] = NULL;
+			}
+		}
+
+		return;
+	}
+
 	for (setno = 0; setno < numHashes; setno++)
 	{
 		select_current_set(aggstate, setno, true);
@@ -1569,6 +1625,9 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (node->grpsetid_filter && !node->input_dispatched)
+					agg_dispatch_input_tuples(node);
+
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -1680,10 +1739,20 @@ agg_retrieve_direct(AggState *aggstate)
 			else if (aggstate->aggstrategy == AGG_MIXED)
 			{
 				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
+				 * Mixed mode; For non-combine case, we've output all the
+				 * grouped stuff and have full hashtables, so switch to
+				 * outputting those. For combine case, phase one does not
+				 * do this, we need to do our own grouping stuff.
 				 */
 				initialize_phase(aggstate, 0);
+
+				if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+				{
+					/* use the store_in which contians the dispatched tuples */
+					aggstate->store_in = aggstate->phase->store_in;
+					agg_fill_hash_table(aggstate);
+				}
+
 				aggstate->table_filled = true;
 				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
 									   &aggstate->perhash[0].hashiter);
@@ -1838,7 +1907,8 @@ agg_retrieve_direct(AggState *aggstate)
 					 * hashtables as well in advance_aggregates.
 					 */
 					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+						aggstate->current_phase == 1 &&
+						!aggstate->grpsetid_filter)
 					{
 						lookup_hash_entries(aggstate);
 					}
@@ -1921,6 +1991,122 @@ agg_retrieve_direct(AggState *aggstate)
 }
 
 /*
+ * ExecAgg for parallel grouping sets:
+ *
+ * When combining the partial groupingsets aggregate results from workers,
+ * the input is mixed with tuples from different grouping sets. To avoid
+ * unnecessary working, the tuples will be pre-dispatched to according
+ * phases directly.
+ *
+ * This function must be called in phase one which is a AGG_SORTED or
+ * AGG_PLAIN.
+ */
+static void
+agg_dispatch_input_tuples(AggState *aggstate)
+{
+	int	grpsetid;
+	int phase;
+	bool isNull;
+	PlanState *saved_sort;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	GrpSetMapping *mapping;
+	TupleTableSlot *outerslot;
+	AggStatePerPhase perphase;
+
+	/* prepare tuplestore or tuplesort for each phase */
+	for (phase = 0; phase < aggstate->numphases; phase++)
+	{
+		perphase = &aggstate->phases[phase];
+
+		if (!perphase->aggnode)
+			continue;
+
+		if (perphase->aggstrategy == AGG_SORTED)
+		{
+			PlanState *outerNode = outerPlanState(aggstate);
+			TupleDesc tupDesc = ExecGetResultType(outerNode);
+			Sort *sortnode = (Sort *) perphase->aggnode->plan.lefttree;
+
+			Assert(perphase->aggstrategy == AGG_SORTED);
+
+			perphase->sort_in = tuplesort_begin_heap(tupDesc,
+													 sortnode->numCols,
+													 sortnode->sortColIdx,
+													 sortnode->sortOperators,
+													 sortnode->collations,
+													 sortnode->nullsFirst,
+													 work_mem,
+													 NULL, false);
+		}
+		else
+			perphase->store_in = tuplestore_begin_heap(false, false, work_mem);
+	}
+
+	/* 
+	 * If phase one is AGG_SORTED, we cannot perform the sort node beneath it
+	 * directly because it comes from different grouping sets, we need to
+	 * dispatch the tuples first and then do the sort.
+	 *
+	 * To do this, we replace the outerPlan of current AGG node with the child
+	 * node of sort node.
+	 *
+	 * This is unnecessary to AGG_PLAIN.
+	 */
+	if (aggstate->phase->aggstrategy == AGG_SORTED)
+	{
+		saved_sort = outerPlanState(aggstate);
+		outerPlanState(aggstate) = outerPlanState(outerPlanState(aggstate));
+	}
+
+	for (;;)
+	{
+		outerslot = fetch_input_tuple(aggstate);
+		if (TupIsNull(outerslot))
+			break;
+
+		/* set up for advance_aggregates */
+		tmpcontext->ecxt_outertuple = outerslot;
+		grpsetid = ExecEvalExprSwitchContext(aggstate->grpsetid_filter,
+											 tmpcontext,
+											 &isNull);
+
+		/* put the slot to according phase with grouping set id */
+		mapping = &aggstate->grpSetMappings[grpsetid];
+		if (!mapping->is_hashed)
+		{
+			perphase = &aggstate->phases[mapping->index];
+
+			if (perphase->aggstrategy == AGG_SORTED)
+				tuplesort_puttupleslot(perphase->sort_in, outerslot);
+			else
+				tuplestore_puttupleslot(perphase->store_in, outerslot);
+		}
+		else
+			tuplestore_puttupleslot(aggstate->phases[0].store_in, outerslot);
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Restore the outer plan and perform the sorting here. */
+	if (aggstate->phase->aggstrategy == AGG_SORTED)
+	{
+		outerPlanState(aggstate) = saved_sort;
+		tuplesort_performsort(aggstate->phase->sort_in);
+	}
+
+	/*
+	 * Reinitialize the phase one to use the store_in
+	 * or sort_in which contains the dispatched tuples.
+	 */
+	aggstate->sort_in = aggstate->phase->sort_in; 
+	aggstate->store_in = aggstate->phase->store_in; 
+	select_current_set(aggstate, 0, false);
+
+	/* mark the input dispatched */
+	aggstate->input_dispatched = true;
+}
+
+/*
  * ExecAgg for hashed case: read input and build hash table
  */
 static void
@@ -2146,6 +2332,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_dispatched = false;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -2158,16 +2345,16 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * determines the size of some allocations.  Also calculate the number of
 	 * phases, since all hashed/mixed nodes contribute to only a single phase.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			/*
 			 * additional AGG_HASHED aggs become part of phase 0, but all
@@ -2186,6 +2373,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
+	/* 
+	 * When combining the partial groupingsets aggregate results, we
+	 * need a grpsetid mapping to find according perhash or perphase
+	 * data.
+	 */
+	if (DO_AGGSPLIT_COMBINE(node->aggsplit) && node->rollup)
+		aggstate->grpSetMappings = (GrpSetMapping *)
+			palloc0(sizeof(GrpSetMapping) * (numPhases + numHashes));
+
 	/*
 	 * Create expression contexts.  We need three or more, one for
 	 * per-input-tuple processing, one for per-output-tuple processing, one
@@ -2243,8 +2439,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	/*
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 *
+	 * Or we are combining the partial groupingsets aggregate results, input
+	 * belong to AGG_HASHED rollup will use a tuplestore. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		(DO_AGGSPLIT_COMBINE(node->aggsplit) &&
+		 node->aggstrategy == AGG_MIXED))
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2291,6 +2492,14 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
 	/*
+	 * Initialize grouping set id expression to identify which
+	 * grouping set the input tuple belongs to when combining
+	 * partial groupingsets aggregate result.
+	 */
+	aggstate->grpsetid_filter = ExecInitExpr((Expr *) node->grpSetIdFilter,
+											 (PlanState *)aggstate);
+
+	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
 	numaggs = aggstate->numaggs;
@@ -2348,6 +2557,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			/* but the actual Agg node representing this hash is saved here */
 			perhash->aggnode = aggnode;
 
+			if (aggnode->rollup)
+			{
+				GroupingSetData *gs =
+					linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+				perhash->grpsetid = gs->grpsetId;
+
+				/* add a mapping when combining */
+				if (DO_AGGSPLIT_COMBINE(aggnode->aggsplit))
+				{
+					aggstate->grpSetMappings[perhash->grpsetid].is_hashed = true;
+					aggstate->grpSetMappings[perhash->grpsetid].index = i;
+				}
+			}
+
 			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
 
 			for (j = 0; j < aggnode->numCols; ++j)
@@ -2363,18 +2587,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			AggStatePerPhase phasedata = &aggstate->phases[++phase];
 			int			num_sets;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			phasedata->numsets = num_sets = aggnode->rollup ?
+										list_length(aggnode->rollup->gsets) : 0;
 
 			if (num_sets)
 			{
 				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
 				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->grpsetids = palloc(num_sets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int			current_length = list_length(lfirst(l));
 					Bitmapset  *cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2382,12 +2609,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
-
+					phasedata->grpsetids[i] = gs->grpsetId;
 					++i;
 				}
 
 				all_grouped_cols = bms_add_members(all_grouped_cols,
 												   phasedata->grouped_cols[0]);
+
+				/* add a mapping when combining */
+				if (DO_AGGSPLIT_COMBINE(node->aggsplit))
+				{
+					aggstate->grpSetMappings[phasedata->grpsetids[0]].is_hashed = false;
+					aggstate->grpSetMappings[phasedata->grpsetids[0]].index = phase;
+				}
 			}
 			else
 			{
@@ -2871,23 +3105,50 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		if (!phase->aggnode)
 			continue;
 
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
+		if (aggstate->aggstrategy == AGG_MIXED &&
+			phaseidx == 1)
 		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
+			if (!DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+			{
+				/*
+				 * Phase one, and only phase one, in a mixed agg performs both
+				 * sorting and aggregation.
+				 */
+				dohash = true;
+				dosort = true;
+			}
+			else
+			{
+				/*
+				 * When combining partial groupingsets aggregate results, input
+				 * is dispatched according to the grouping set id, we cannot
+				 * perform both sorting and hashing aggregation in one phase,
+				 * just perform the sorting aggregation.
+				 */	
+				dohash = false;
+				dosort = true;
+			}
 		}
 		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
 		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
-			continue;
+			if (!DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+			{
+				/*
+				 * No need to compute a transition function for an AGG_MIXED phase
+				 * 0 - the contents of the hashtables will have been computed
+				 * during phase 1.
+				 */
+				continue;
+			}
+			else
+			{
+				/*
+				 * When combining partial groupingsets aggregate results, phase
+				 * 0 need to do its own hashing aggregate.
+				 */
+				dohash = true;
+				dosort = false;
+			}
 		}
 		else if (phase->aggstrategy == AGG_PLAIN ||
 				 phase->aggstrategy == AGG_SORTED)
@@ -3440,6 +3701,7 @@ ExecReScanAgg(AggState *node)
 	int			setno;
 
 	node->agg_done = false;
+	node->input_dispatched = false;
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index a2617c7..d3ec4b5 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -986,7 +986,7 @@ _copyAgg(const Agg *from)
 	}
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
 
 	return newnode;
@@ -1474,6 +1474,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 }
 
 /*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(grpsetId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
+/*
  * _copyWindowFunc
  */
 static WindowFunc *
@@ -4938,6 +4982,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5568,6 +5615,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 4f2ebe5..dec6d4f 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3049,6 +3049,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 18bd5ac..8dc702f 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -63,6 +63,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -741,6 +744,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1870,6 +1876,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2506,6 +2513,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e6ce8e2..b3ff513 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -781,7 +781,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_OID_ARRAY(grpCollations, node->numCols);
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
 }
 
@@ -1146,6 +1146,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 }
 
 static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
+static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
 	WRITE_NODE_TYPE("WINDOWFUNC");
@@ -1996,6 +2003,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(grpsetId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3824,6 +3832,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 764e3bb..4f76957 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -637,6 +637,50 @@ _readGroupingFunc(void)
 }
 
 /*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(grpsetId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
  * _readWindowFunc
  */
 static WindowFunc *
@@ -2171,7 +2215,7 @@ _readAgg(void)
 	READ_OID_ARRAY(grpCollations, local_node->numCols);
 	READ_LONG_FIELD(numGroups);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
 
 	READ_DONE();
@@ -2607,6 +2651,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index db3a68a..a357f37 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2708,6 +2708,9 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 						   NULL, rowsp);
 	add_path(rel, simple_gather_path);
 
+	if (root->parse->groupingSets)
+		return;
+
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
 	 * Merge.
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 0c03620..0d07c71 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1639,7 +1639,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 subplan);
@@ -2091,7 +2091,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					subplan);
@@ -2202,7 +2202,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	 * never be grouping in an UPDATE/DELETE; but let's Assert that.
 	 */
 	Assert(root->inhTargetKind == INHKIND_NONE);
-	Assert(root->grouping_map == NULL);
+//	Assert(root->grouping_map == NULL);
 	root->grouping_map = grouping_map;
 
 	/*
@@ -2247,12 +2247,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 sort_plan);
@@ -2285,12 +2285,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						subplan);
@@ -6189,7 +6189,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain,
+		 RollupData *rollup, List *chain,
 		 double dNumGroups, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6207,7 +6207,7 @@ make_agg(List *tlist, List *qual,
 	node->grpCollations = grpCollations;
 	node->numGroups = numGroups;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 17c5f08..f147cac 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -107,6 +107,7 @@ typedef struct
 typedef struct
 {
 	List	   *rollups;
+	List	   *final_rollups;
 	List	   *hash_sets_idx;
 	double		dNumHashGroups;
 	bool		any_hashable;
@@ -114,6 +115,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   numGroupingSets;
 } grouping_sets_data;
 
 /*
@@ -127,6 +129,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*add_path_callback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -143,7 +147,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -176,7 +181,10 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										List *havingQual,
+										AggSplit aggsplit);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -2437,6 +2445,8 @@ preprocess_grouping_sets(PlannerInfo *root)
 	int			maxref = 0;
 	ListCell   *lc;
 	ListCell   *lc_set;
+	ListCell   *lc_rollup;
+	RollupData *rollup;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
 	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
@@ -2488,6 +2498,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->grpsetId = gd->numGroupingSets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2519,8 +2530,8 @@ preprocess_grouping_sets(PlannerInfo *root)
 	foreach(lc_set, sets)
 	{
 		List	   *current_sets = (List *) lfirst(lc_set);
-		RollupData *rollup = makeNode(RollupData);
 		GroupingSetData *gs;
+		rollup = makeNode(RollupData);
 
 		/*
 		 * Reorder the current list of grouping sets into correct prefix
@@ -2532,7 +2543,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -2584,6 +2595,33 @@ preprocess_grouping_sets(PlannerInfo *root)
 		gd->rollups = lappend(gd->rollups, rollup);
 	}
 
+	/* divide rollups to xxx */
+	foreach(lc_rollup, gd->rollups)
+	{
+		RollupData *initial_rollup = lfirst(lc_rollup);
+
+		foreach(lc, initial_rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst(lc);
+			rollup = makeNode(RollupData);
+
+			if (gs->set == NIL)
+				rollup->groupClause = NIL;	
+			else
+				rollup->groupClause = preprocess_groupclause(root, gs->set);
+			rollup->gsets_data = list_make1(gs);
+			rollup->gsets = remap_to_groupclause_idx(rollup->groupClause,
+													 rollup->gsets_data,
+													 gd->tleref_to_colnum_map);
+
+			rollup->numGroups = gs->numGroups;
+			rollup->hashable = initial_rollup->hashable;
+			rollup->is_hashed = initial_rollup->is_hashed;
+
+			gd->final_rollups = lappend(gd->final_rollups, rollup);
+		}
+	}
+
 	if (gd->unsortable_sets)
 	{
 		/*
@@ -3541,7 +3579,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3575,6 +3613,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->grpsetId = gd->numGroupingSets++;
 		result = lcons(gs, result);
 	}
 
@@ -3725,6 +3764,30 @@ get_number_of_groups(PlannerInfo *root,
 				dNumGroups += rollup->numGroups;
 			}
 
+			foreach(lc, gd->final_rollups)
+			{
+				RollupData *rollup = lfirst_node(RollupData, lc);
+				ListCell   *lc;
+
+				groupExprs = get_sortgrouplist_exprs(rollup->groupClause,
+													 target_list);
+
+				rollup->numGroups = 0.0;
+
+				forboth(lc, rollup->gsets, lc2, rollup->gsets_data)
+				{
+					List	   *gset = (List *) lfirst(lc);
+					GroupingSetData *gs = lfirst_node(GroupingSetData, lc2);
+					double		numGroups = estimate_num_groups(root,
+																groupExprs,
+																path_rows,
+																&gset);
+
+					gs->numGroups = numGroups;
+					rollup->numGroups += numGroups;
+				}
+			}
+
 			if (gd->hash_sets_idx)
 			{
 				ListCell   *lc;
@@ -4190,9 +4253,26 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							List *havingQual,
+							AggSplit aggsplit)
 {
-	Query	   *parse = root->parse;
+	/* For partial path, add it to partial_pathlist */
+	add_path_callback add_path_cb =
+		(aggsplit == AGGSPLIT_INITIAL_SERIAL) ? add_partial_path : add_path;
+
+	/* 
+	 * If we are combining the partial groupingsets aggregation, the input is
+	 * mixed with tuples from different grouping sets, executor dispatch the
+	 * tuples to different rollups (phases) according to the grouping set id.
+	 *
+	 * We cannot use the same rollups with initial stage in which each tuple
+	 * is processed by one or more grouping sets in one rollup, because in
+	 * combining stage, each tuple only belong to one single grouping set.
+	 * In this case, we use final_rollups instead in which each rollup has
+	 * only one grouping set.
+	 */
+	List *rollups = DO_AGGSPLIT_COMBINE(aggsplit) ? gd->final_rollups : gd->rollups;
 
 	/*
 	 * If we're not being offered sorted input, then only consider plans that
@@ -4213,7 +4293,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		List	   *empty_sets_data = NIL;
 		List	   *empty_sets = NIL;
 		ListCell   *lc;
-		ListCell   *l_start = list_head(gd->rollups);
+		ListCell   *l_start = list_head(rollups);
 		AggStrategy strat = AGG_HASHED;
 		double		hashsize;
 		double		exclude_groups = 0.0;
@@ -4245,7 +4325,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		{
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
-			l_start = lnext(gd->rollups, l_start);
+			l_start = lnext(rollups, l_start);
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4253,11 +4333,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  dNumGroups - exclude_groups);
 
 		/*
-		 * gd->rollups is empty if we have only unsortable columns to work
+		 * rollups is empty if we have only unsortable columns to work
 		 * with.  Override work_mem in that case; otherwise, we'll rely on the
 		 * sorted-input case to generate usable mixed paths.
 		 */
-		if (hashsize > work_mem * 1024L && gd->rollups)
+		if (hashsize > work_mem * 1024L && rollups)
 			return;				/* nope, won't fit */
 
 		/*
@@ -4266,7 +4346,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		 */
 		sets_data = list_copy(gd->unsortable_sets);
 
-		for_each_cell(lc, gd->rollups, l_start)
+		for_each_cell(lc, rollups, l_start)
 		{
 			RollupData *rollup = lfirst_node(RollupData, lc);
 
@@ -4334,34 +4414,60 @@ consider_groupingsets_paths(PlannerInfo *root,
 		}
 		else if (empty_sets)
 		{
-			RollupData *rollup = makeNode(RollupData);
+			/*
+			 * If we are doing combining, each empty set is made to a single
+			 * rollup, otherwise, all empty sets are made to one rollup.
+			 */
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+			{
+				ListCell *lc2;
+				forboth(lc, empty_sets, lc2, empty_sets_data)
+				{
+					GroupingSetData *gs = lfirst_node(GroupingSetData, lc2);
+					RollupData *rollup = makeNode(RollupData);
+
+					rollup->groupClause = NIL;
+					rollup->gsets_data = list_make1(gs); 
+					rollup->gsets = list_make1(NIL);
+					rollup->numGroups = 1;
+					rollup->hashable = false;
+					rollup->is_hashed = false;
+					new_rollups = lappend(new_rollups, rollup);
+				}
+			}
+			else
+			{
+				RollupData *rollup = makeNode(RollupData);
+
+				rollup->groupClause = NIL;
+				rollup->gsets_data = empty_sets_data;
+				rollup->gsets = empty_sets;
+				rollup->numGroups = list_length(empty_sets);
+				rollup->hashable = false;
+				rollup->is_hashed = false;
+				new_rollups = lappend(new_rollups, rollup);
+			}
 
-			rollup->groupClause = NIL;
-			rollup->gsets_data = empty_sets_data;
-			rollup->gsets = empty_sets;
-			rollup->numGroups = list_length(empty_sets);
-			rollup->hashable = false;
-			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups));
+		add_path_cb(grouped_rel, (Path *)
+					  create_groupingsets_path(root,
+											   grouped_rel,
+											   path,
+											   havingQual,
+											   strat,
+											   new_rollups,
+											   agg_costs,
+											   dNumGroups,
+											   aggsplit));
 		return;
 	}
 
 	/*
 	 * If we have sorted input but nothing we can do with it, bail.
 	 */
-	if (list_length(gd->rollups) == 0)
+	if (list_length(rollups) == 0)
 		return;
 
 	/*
@@ -4374,7 +4480,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 */
 	if (can_hash && gd->any_hashable)
 	{
-		List	   *rollups = NIL;
+		List	   *mixed_rollups = NIL;
 		List	   *hash_sets = list_copy(gd->unsortable_sets);
 		double		availspace = (work_mem * 1024.0);
 		ListCell   *lc;
@@ -4386,10 +4492,10 @@ consider_groupingsets_paths(PlannerInfo *root,
 												 agg_costs,
 												 gd->dNumHashGroups);
 
-		if (availspace > 0 && list_length(gd->rollups) > 1)
+		if (availspace > 0 && list_length(rollups) > 1)
 		{
 			double		scale;
-			int			num_rollups = list_length(gd->rollups);
+			int			num_rollups = list_length(rollups);
 			int			k_capacity;
 			int		   *k_weights = palloc(num_rollups * sizeof(int));
 			Bitmapset  *hash_items = NULL;
@@ -4427,11 +4533,13 @@ consider_groupingsets_paths(PlannerInfo *root,
 			 * below, must use the same condition.
 			 */
 			i = 0;
-			for_each_cell(lc, gd->rollups, list_second_cell(gd->rollups))
+			for_each_cell(lc, rollups, list_second_cell(rollups))
 			{
 				RollupData *rollup = lfirst_node(RollupData, lc);
 
-				if (rollup->hashable)
+				/* Empty set cannot be hashed either */
+				if (rollup->hashable &&
+					list_length(linitial(rollup->gsets)) != 0)
 				{
 					double		sz = estimate_hashagg_tablesize(path,
 																agg_costs,
@@ -4458,30 +4566,31 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			if (!bms_is_empty(hash_items))
 			{
-				rollups = list_make1(linitial(gd->rollups));
+				mixed_rollups = list_make1(linitial(rollups));
 
 				i = 0;
-				for_each_cell(lc, gd->rollups, list_second_cell(gd->rollups))
+				for_each_cell(lc, rollups, list_second_cell(rollups))
 				{
 					RollupData *rollup = lfirst_node(RollupData, lc);
 
-					if (rollup->hashable)
+					if (rollup->hashable &&
+						list_length(linitial(rollup->gsets)) != 0)
 					{
 						if (bms_is_member(i, hash_items))
 							hash_sets = list_concat(hash_sets,
 													rollup->gsets_data);
 						else
-							rollups = lappend(rollups, rollup);
+							mixed_rollups = lappend(mixed_rollups, rollup);
 						++i;
 					}
 					else
-						rollups = lappend(rollups, rollup);
+						mixed_rollups = lappend(mixed_rollups, rollup);
 				}
 			}
 		}
 
-		if (!rollups && hash_sets)
-			rollups = list_copy(gd->rollups);
+		if (!mixed_rollups && hash_sets)
+			mixed_rollups = list_copy(rollups);
 
 		foreach(lc, hash_sets)
 		{
@@ -4498,20 +4607,21 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			mixed_rollups = lcons(rollup, mixed_rollups);
 		}
 
-		if (rollups)
+		if (mixed_rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups));
+			add_path_cb(grouped_rel, (Path *)
+						  create_groupingsets_path(root,
+												   grouped_rel,
+												   path,
+												   havingQual,
+												   AGG_MIXED,
+												   mixed_rollups,
+												   agg_costs,
+												   dNumGroups,
+												   aggsplit));
 		}
 	}
 
@@ -4519,15 +4629,16 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups));
+		add_path_cb(grouped_rel, (Path *)
+					  create_groupingsets_path(root,
+											   grouped_rel,
+											   path,
+											   havingQual,
+											   AGG_SORTED,
+											   rollups,
+											   agg_costs,
+											   dNumGroups,
+											   aggsplit));
 }
 
 /*
@@ -5242,6 +5353,13 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
+	/* Add  */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
 	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
@@ -6412,7 +6530,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups,
+												havingQual,
+												AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6479,7 +6599,15 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGGSPLIT_FINAL_DESERIAL);
+				}
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6514,7 +6642,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										havingQual,
+										AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6557,22 +6687,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGGSPLIT_FINAL_DESERIAL);
+			}
+			else
+			{
 
-			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
+
+				if (hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6789,8 +6934,16 @@ create_partial_grouping_paths(PlannerInfo *root,
 													 path,
 													 root->group_pathkeys,
 													 -1.0);
-
-				if (parse->hasAggs)
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, true, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGGSPLIT_INITIAL_SERIAL);
+				}
+				else if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
 									 create_agg_path(root,
 													 partially_grouped_rel,
@@ -6851,26 +7004,39 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if (hashaggtablesize < work_mem * 1024L &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGGSPLIT_INITIAL_SERIAL);
+		}
+		else 
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if (hashaggtablesize < work_mem * 1024L &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6913,6 +7079,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
 	generate_gather_paths(root, rel, true);
 
+	if (root->parse->groupingSets)
+		return;
+
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
@@ -6958,11 +7127,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 566ee96..d8723b9 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -728,6 +728,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If this node is combining partial-groupingsets-aggregation,
+					 * we must add reference to the GroupingSetsId expression in
+					 * the targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->grpSetIdFilter = fix_upper_expr(root, (Node *)expr,
+															 subplan_itlist,
+															 OUTER_VAR,
+															 rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 34acb73..578ad60 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2992,7 +2992,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 AggSplit aggsplit)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3010,6 +3011,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit= aggsplit;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 3e64390..f3e5766 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -7874,6 +7874,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index d21dbead..1361955 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
@@ -226,6 +227,7 @@ typedef enum ExprEvalOp
 	EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
 	EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
 	EEOP_AGG_INIT_TRANS,
+	EEOP_AGG_PERHASH_NULL_CHECK,
 	EEOP_AGG_STRICT_TRANS_CHECK,
 	EEOP_AGG_PLAIN_TRANS_BYVAL,
 	EEOP_AGG_PLAIN_TRANS,
@@ -573,6 +575,12 @@ typedef struct ExprEvalStep
 			List	   *clauses;	/* integer list of column numbers */
 		}			grouping_func;
 
+		/* for EEOP_GROUPING_SET_ID */
+		struct
+		{
+			AggState   *parent; /* parent Agg */
+		}			grouping_set_id;
+
 		/* for EEOP_WINDOW_FUNC */
 		struct
 		{
@@ -634,6 +642,17 @@ typedef struct ExprEvalStep
 			int			jumpnull;
 		}			agg_init_trans;
 
+		/* for EEOP_AGG_PERHASH_NULL_CHECK */
+		struct
+		{
+			AggState   *aggstate;
+			AggStatePerTrans pertrans;
+			int			setno;
+			int			transno;
+			int			setoff;
+			int			jumpnull;
+		}			agg_perhash_null_check;
+
 		/* for EEOP_AGG_STRICT_TRANS_CHECK */
 		struct
 		{
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 1a8ca98..4e5ec06 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -280,6 +280,11 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+
+	/* field for parallel grouping sets */
+	int *grpsetids;
+	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
+	Tuplestorestate *store_in;	/* sorted input to phases > 1 */
 }			AggStatePerPhaseData;
 
 /*
@@ -302,8 +307,10 @@ typedef struct AggStatePerHashData
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
 	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
 
+	/* field for parallel grouping sets */
+	int grpsetid;
+}			AggStatePerHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 063b490..0ba408e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1972,6 +1972,13 @@ typedef struct GroupState
  *	expressions and run the aggregate transition functions.
  * ---------------------
  */
+/* mapping from grouping set id to perphase or perhash data */
+typedef struct GrpSetMapping
+{
+	bool	is_hashed;
+	int		index; 		/* index of aggstate->perhash[] or aggstate->phases[]*/
+} GrpSetMapping;
+
 /* these structs are private in nodeAgg.c: */
 typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
@@ -2013,6 +2020,7 @@ typedef struct AggState
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
+	Tuplestorestate *store_in;	/* sorted input to phases > 1 */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
 									 * pointers */
@@ -2029,6 +2037,12 @@ typedef struct AggState
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* support for parallel grouping sets */
+	bool input_dispatched;
+	ExprState *grpsetid_filter;				/* filter to fetch grouping set id
+											   from child targetlist */
+	struct GrpSetMapping *grpSetMappings;	/* grpsetid <-> perhash or perphase data */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 3cbb08d..9594201 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 13b147d..6d6fc55 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1674,6 +1674,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			grpsetId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1699,6 +1700,7 @@ typedef struct GroupingSetsPath
 	AggStrategy aggstrategy;	/* basic strategy */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
+	AggSplit   aggsplit;
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 8e6594e..74e8fb5 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -811,8 +812,9 @@ typedef struct Agg
 	long		numGroups;		/* estimated number of groups in input */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Node	   *grpSetIdFilter;
 } Agg;
 
 /* ----------------
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index 860a84d..e96cbfc 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -350,6 +350,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* add comment */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index a12af54..900070b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  AggSplit aggsplit);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index e7aaddd..b28476f 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain,
+					 RollupData *rollup, List *chain,
 					 double dNumGroups, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
-- 
1.8.3.1

#12

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Pengzhou Tang (#11)

Re: Parallel grouping sets

On Thu, Nov 28, 2019 at 07:07:22PM +0800, Pengzhou Tang wrote:

Richard pointed out that he get incorrect results with the patch I
attached, there are bugs somewhere,
I fixed them now and attached the newest version, please refer to [1] for
the fix.

Mr Robot is reporting that the latest patch fails to build at least on
Windows. Could you please send a rebase? I have moved for now the
patch to next CF, waiting on author.
--
Michael

#13

Richard Guo

riguo@pivotal.io

about 6 years ago

In reply to: Michael Paquier (#12)

1 attachment(s)

Re: Parallel grouping sets

On Sun, Dec 1, 2019 at 10:03 AM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Nov 28, 2019 at 07:07:22PM +0800, Pengzhou Tang wrote:

Richard pointed out that he get incorrect results with the patch I
attached, there are bugs somewhere,
I fixed them now and attached the newest version, please refer to [1] for
the fix.

Mr Robot is reporting that the latest patch fails to build at least on
Windows. Could you please send a rebase? I have moved for now the
patch to next CF, waiting on author.

Thanks for reporting this issue. Here is the rebase.

Thanks
Richard

Attachments:

v3-0001-Support-for-parallel-grouping-sets.patchapplication/octet-stream; name=v3-0001-Support-for-parallel-grouping-sets.patchDownload

From 96aa8b276e02bf8969438e5f1bb4b7944df395aa Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Tue, 24 Sep 2019 04:22:42 -0400
Subject: [PATCH] Support for parallel grouping sets

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets in multiple workers.

In the first stage, the partial aggregates are performed by
multiple workers, each worker perform the aggregates on all
grouping sets, meanwile, a grouping set id is attached to
the tuples of first stage to identify which grouping set the
tuple belongs to. In the final stage, the gathered tuples are
dispatched to specified grouping set according to the
additional set id and then perform combine aggregates per
grouping set. We don't use GROUPING() func to identify the
grouping set because a sets may contain duplicate grouping
set.

Some changes are also made by executor in final stage:

For AGG_HASHED strategy, all grouping sets still perform
combine aggregates in phase 0, the only difference is that
only one group is selected in final stage, so we need to
skip those un-selected groups.

For AGG_MIXED strategy, phase 0 now also need to do its
own aggregate now.

For AGG_SORTED strategy, rollup will be expanded, eg:
rollup(<c1, c2>, <c1>, <>) is expanded to three rollups:
rollup(<c1, c2>), rollup(<c1>) and rollup(<>). so tuples
can be dispatched to those three phases and do aggregate
then.
---
 src/backend/commands/explain.c          |  10 +-
 src/backend/executor/execExpr.c         |  42 +++-
 src/backend/executor/execExprInterp.c   |  34 +++
 src/backend/executor/nodeAgg.c          | 330 +++++++++++++++++++++++++---
 src/backend/nodes/copyfuncs.c           |  55 ++++-
 src/backend/nodes/equalfuncs.c          |   3 +
 src/backend/nodes/nodeFuncs.c           |   8 +
 src/backend/nodes/outfuncs.c            |  13 +-
 src/backend/nodes/readfuncs.c           |  52 ++++-
 src/backend/optimizer/path/allpaths.c   |   3 +
 src/backend/optimizer/plan/createplan.c |  18 +-
 src/backend/optimizer/plan/planner.c    | 376 +++++++++++++++++++++++---------
 src/backend/optimizer/plan/setrefs.c    |  16 ++
 src/backend/optimizer/util/pathnode.c   |   4 +-
 src/backend/utils/adt/ruleutils.c       |   6 +
 src/include/executor/execExpr.h         |  19 ++
 src/include/executor/nodeAgg.h          |   9 +-
 src/include/nodes/execnodes.h           |  14 ++
 src/include/nodes/nodes.h               |   1 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |   4 +-
 src/include/nodes/primnodes.h           |   6 +
 src/include/optimizer/pathnode.h        |   3 +-
 src/include/optimizer/planmain.h        |   2 +-
 24 files changed, 869 insertions(+), 161 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d189b8d..828e863 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2224,12 +2224,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->grpSetIdFilter)
+		show_expression(plan->grpSetIdFilter, "Dispatched by",
+						(PlanState *)astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(plan, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2281,7 +2285,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 8619246..321a016 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -814,7 +814,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -823,6 +823,15 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+				scratch.d.grouping_set_id.parent = (AggState *) state->parent;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
@@ -3230,6 +3239,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 {
 	int			adjust_init_jumpnull = -1;
 	int			adjust_strict_jumpnull = -1;
+	int			adjust_perhash_jumpnull = -1;
 	ExprContext *aggcontext;
 
 	if (ishash)
@@ -3262,6 +3272,30 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		adjust_init_jumpnull = state->steps_len - 1;
 	}
 
+	/*
+	 * All grouping sets that use AGG_HASHED are sent to
+	 * phases zero, when combining the partial aggregate
+	 * results, only one group is select for one tuple,
+	 * so we need to add one more check step to skip not
+	 * selected groups.
+	 */
+	if (ishash && aggstate->grpsetid_filter &&
+		DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		scratch->opcode = EEOP_AGG_PERHASH_NULL_CHECK;
+		scratch->d.agg_perhash_null_check.aggstate = aggstate;
+		scratch->d.agg_perhash_null_check.setno = setno;
+		scratch->d.agg_perhash_null_check.setoff = setoff;
+		scratch->d.agg_perhash_null_check.transno = transno;
+		scratch->d.agg_perhash_null_check.jumpnull = -1;	/* adjust later */
+		ExprEvalPushStep(state, scratch);
+
+		/*
+		 * Note, we don't push into adjust_bailout here - those jump to the
+		 */
+		adjust_perhash_jumpnull = state->steps_len - 1;
+	}
+
 	if (pertrans->numSortCols == 0 &&
 		fcinfo->flinfo->fn_strict)
 	{
@@ -3307,6 +3341,12 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		Assert(as->d.agg_init_trans.jumpnull == -1);
 		as->d.agg_init_trans.jumpnull = state->steps_len;
 	}
+	if (adjust_perhash_jumpnull != -1)
+	{
+		ExprEvalStep *as = &state->steps[adjust_perhash_jumpnull];
+		Assert(as->d.agg_perhash_null_check.jumpnull == -1);
+		as->d.agg_perhash_null_check.jumpnull = state->steps_len;
+	}
 	if (adjust_strict_jumpnull != -1)
 	{
 		ExprEvalStep *as = &state->steps[adjust_strict_jumpnull];
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 7903800..2e49086 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -422,6 +422,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -430,6 +431,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
 		&&CASE_EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
 		&&CASE_EEOP_AGG_INIT_TRANS,
+		&&CASE_EEOP_AGG_PERHASH_NULL_CHECK,
 		&&CASE_EEOP_AGG_STRICT_TRANS_CHECK,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BYVAL,
 		&&CASE_EEOP_AGG_PLAIN_TRANS,
@@ -1503,6 +1505,21 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			int			grpsetid;		
+			AggState	*aggstate = (AggState *) op->d.grouping_set_id.parent;
+
+			if (aggstate->current_phase == 0)
+				grpsetid = aggstate->perhash[aggstate->current_set].grpsetid;	
+			else
+				grpsetid = aggstate->phase->grpsetids[aggstate->current_set];
+
+			*op->resvalue = grpsetid;
+			*op->resnull = false;
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
@@ -1626,6 +1643,23 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PERHASH_NULL_CHECK)
+		{
+			AggState   *aggstate;
+			AggStatePerGroup pergroup;
+
+			aggstate = op->d.agg_perhash_null_check.aggstate;
+			pergroup = &aggstate->all_pergroups
+				[op->d.agg_perhash_null_check.setoff]
+				[op->d.agg_perhash_null_check.transno];
+
+			/* If transValue has not yet been initialized, do so now. */
+			if (!pergroup)
+				EEO_JUMP(op->d.agg_perhash_null_check.jumpnull);
+
+			EEO_NEXT();
+		}
+
 		/* check that a strict aggregate's input isn't NULL */
 		EEO_CASE(EEOP_AGG_STRICT_TRANS_CHECK)
 		{
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 98bee4c..606b30d 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -226,6 +226,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -275,6 +276,7 @@ static void build_hash_table(AggState *aggstate);
 static TupleHashEntryData *lookup_hash_entry(AggState *aggstate);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_dispatch_input_tuples(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
@@ -313,9 +315,6 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 /*
  * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
@@ -332,6 +331,12 @@ initialize_phase(AggState *aggstate, int newphase)
 		aggstate->sort_in = NULL;
 	}
 
+	if (aggstate->store_in)
+	{
+		tuplestore_end(aggstate->store_in);
+		aggstate->store_in = NULL;	
+	}
+
 	if (newphase <= 1)
 	{
 		/*
@@ -345,21 +350,36 @@ initialize_phase(AggState *aggstate, int newphase)
 	}
 	else
 	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
+		/* 
+		 * When combining partial grouping sets aggregate results, we use
+		 * the sort_in or store_in which contains the dispatched tuples as
+		 * the input. Otherwise, use the the sort_out of previous phase.
 		 */
-		aggstate->sort_in = aggstate->sort_out;
+		if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+		{
+			aggstate->sort_in = aggstate->phases[newphase].sort_in;
+			aggstate->store_in = aggstate->phases[newphase].store_in;
+		}
+		else
+		{
+			aggstate->sort_in = aggstate->sort_out;
+			aggstate->store_in = NULL;
+		}
+
 		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
+		Assert(aggstate->sort_in || aggstate->store_in);
+
+		/* This is the right time to actually sort it. */
+		if (aggstate->sort_in)
+			tuplesort_performsort(aggstate->sort_in);
 	}
 
 	/*
 	 * If this isn't the last phase, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
+	if (aggstate->aggsplit != AGGSPLIT_FINAL_DESERIAL &&
+		newphase > 0 && newphase < aggstate->numphases - 1)
 	{
 		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
@@ -401,6 +421,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (aggstate->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(aggstate->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -1527,6 +1556,33 @@ lookup_hash_entries(AggState *aggstate)
 	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
 	int			setno;
 
+	if (aggstate->grpsetid_filter)
+	{
+		bool dummynull;
+		int grpsetid = ExecEvalExprSwitchContext(aggstate->grpsetid_filter,
+											   aggstate->tmpcontext,
+											   &dummynull);
+		GrpSetMapping *mapping = &aggstate->grpSetMappings[grpsetid];
+
+		if (!mapping)
+			return;
+
+		for (setno = 0; setno < numHashes; setno++)
+		{
+			if (setno == mapping->index)
+			{
+				select_current_set(aggstate, setno, true);
+				pergroup[setno] = lookup_hash_entry(aggstate)->additional;
+			}
+			else
+			{
+				pergroup[setno] = NULL;
+			}
+		}
+
+		return;
+	}
+
 	for (setno = 0; setno < numHashes; setno++)
 	{
 		select_current_set(aggstate, setno, true);
@@ -1569,6 +1625,9 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (node->grpsetid_filter && !node->input_dispatched)
+					agg_dispatch_input_tuples(node);
+
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -1680,10 +1739,20 @@ agg_retrieve_direct(AggState *aggstate)
 			else if (aggstate->aggstrategy == AGG_MIXED)
 			{
 				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
+				 * Mixed mode; For non-combine case, we've output all the
+				 * grouped stuff and have full hashtables, so switch to
+				 * outputting those. For combine case, phase one does not
+				 * do this, we need to do our own grouping stuff.
 				 */
 				initialize_phase(aggstate, 0);
+
+				if (DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+				{
+					/* use the store_in which contians the dispatched tuples */
+					aggstate->store_in = aggstate->phase->store_in;
+					agg_fill_hash_table(aggstate);
+				}
+
 				aggstate->table_filled = true;
 				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
 									   &aggstate->perhash[0].hashiter);
@@ -1838,7 +1907,8 @@ agg_retrieve_direct(AggState *aggstate)
 					 * hashtables as well in advance_aggregates.
 					 */
 					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+						aggstate->current_phase == 1 &&
+						!aggstate->grpsetid_filter)
 					{
 						lookup_hash_entries(aggstate);
 					}
@@ -1921,6 +1991,122 @@ agg_retrieve_direct(AggState *aggstate)
 }
 
 /*
+ * ExecAgg for parallel grouping sets:
+ *
+ * When combining the partial groupingsets aggregate results from workers,
+ * the input is mixed with tuples from different grouping sets. To avoid
+ * unnecessary working, the tuples will be pre-dispatched to according
+ * phases directly.
+ *
+ * This function must be called in phase one which is a AGG_SORTED or
+ * AGG_PLAIN.
+ */
+static void
+agg_dispatch_input_tuples(AggState *aggstate)
+{
+	int	grpsetid;
+	int phase;
+	bool isNull;
+	PlanState *saved_sort;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	GrpSetMapping *mapping;
+	TupleTableSlot *outerslot;
+	AggStatePerPhase perphase;
+
+	/* prepare tuplestore or tuplesort for each phase */
+	for (phase = 0; phase < aggstate->numphases; phase++)
+	{
+		perphase = &aggstate->phases[phase];
+
+		if (!perphase->aggnode)
+			continue;
+
+		if (perphase->aggstrategy == AGG_SORTED)
+		{
+			PlanState *outerNode = outerPlanState(aggstate);
+			TupleDesc tupDesc = ExecGetResultType(outerNode);
+			Sort *sortnode = (Sort *) perphase->aggnode->plan.lefttree;
+
+			Assert(perphase->aggstrategy == AGG_SORTED);
+
+			perphase->sort_in = tuplesort_begin_heap(tupDesc,
+													 sortnode->numCols,
+													 sortnode->sortColIdx,
+													 sortnode->sortOperators,
+													 sortnode->collations,
+													 sortnode->nullsFirst,
+													 work_mem,
+													 NULL, false);
+		}
+		else
+			perphase->store_in = tuplestore_begin_heap(false, false, work_mem);
+	}
+
+	/* 
+	 * If phase one is AGG_SORTED, we cannot perform the sort node beneath it
+	 * directly because it comes from different grouping sets, we need to
+	 * dispatch the tuples first and then do the sort.
+	 *
+	 * To do this, we replace the outerPlan of current AGG node with the child
+	 * node of sort node.
+	 *
+	 * This is unnecessary to AGG_PLAIN.
+	 */
+	if (aggstate->phase->aggstrategy == AGG_SORTED)
+	{
+		saved_sort = outerPlanState(aggstate);
+		outerPlanState(aggstate) = outerPlanState(outerPlanState(aggstate));
+	}
+
+	for (;;)
+	{
+		outerslot = fetch_input_tuple(aggstate);
+		if (TupIsNull(outerslot))
+			break;
+
+		/* set up for advance_aggregates */
+		tmpcontext->ecxt_outertuple = outerslot;
+		grpsetid = ExecEvalExprSwitchContext(aggstate->grpsetid_filter,
+											 tmpcontext,
+											 &isNull);
+
+		/* put the slot to according phase with grouping set id */
+		mapping = &aggstate->grpSetMappings[grpsetid];
+		if (!mapping->is_hashed)
+		{
+			perphase = &aggstate->phases[mapping->index];
+
+			if (perphase->aggstrategy == AGG_SORTED)
+				tuplesort_puttupleslot(perphase->sort_in, outerslot);
+			else
+				tuplestore_puttupleslot(perphase->store_in, outerslot);
+		}
+		else
+			tuplestore_puttupleslot(aggstate->phases[0].store_in, outerslot);
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Restore the outer plan and perform the sorting here. */
+	if (aggstate->phase->aggstrategy == AGG_SORTED)
+	{
+		outerPlanState(aggstate) = saved_sort;
+		tuplesort_performsort(aggstate->phase->sort_in);
+	}
+
+	/*
+	 * Reinitialize the phase one to use the store_in
+	 * or sort_in which contains the dispatched tuples.
+	 */
+	aggstate->sort_in = aggstate->phase->sort_in; 
+	aggstate->store_in = aggstate->phase->store_in; 
+	select_current_set(aggstate, 0, false);
+
+	/* mark the input dispatched */
+	aggstate->input_dispatched = true;
+}
+
+/*
  * ExecAgg for hashed case: read input and build hash table
  */
 static void
@@ -2146,6 +2332,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_dispatched = false;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -2158,16 +2345,16 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * determines the size of some allocations.  Also calculate the number of
 	 * phases, since all hashed/mixed nodes contribute to only a single phase.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			/*
 			 * additional AGG_HASHED aggs become part of phase 0, but all
@@ -2186,6 +2373,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
+	/* 
+	 * When combining the partial groupingsets aggregate results, we
+	 * need a grpsetid mapping to find according perhash or perphase
+	 * data.
+	 */
+	if (DO_AGGSPLIT_COMBINE(node->aggsplit) && node->rollup)
+		aggstate->grpSetMappings = (GrpSetMapping *)
+			palloc0(sizeof(GrpSetMapping) * (numPhases + numHashes));
+
 	/*
 	 * Create expression contexts.  We need three or more, one for
 	 * per-input-tuple processing, one for per-output-tuple processing, one
@@ -2243,8 +2439,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	/*
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 *
+	 * Or we are combining the partial groupingsets aggregate results, input
+	 * belong to AGG_HASHED rollup will use a tuplestore. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		(DO_AGGSPLIT_COMBINE(node->aggsplit) &&
+		 node->aggstrategy == AGG_MIXED))
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2291,6 +2492,14 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
 	/*
+	 * Initialize grouping set id expression to identify which
+	 * grouping set the input tuple belongs to when combining
+	 * partial groupingsets aggregate result.
+	 */
+	aggstate->grpsetid_filter = ExecInitExpr((Expr *) node->grpSetIdFilter,
+											 (PlanState *)aggstate);
+
+	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
 	numaggs = aggstate->numaggs;
@@ -2348,6 +2557,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			/* but the actual Agg node representing this hash is saved here */
 			perhash->aggnode = aggnode;
 
+			if (aggnode->rollup)
+			{
+				GroupingSetData *gs =
+					linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+				perhash->grpsetid = gs->grpsetId;
+
+				/* add a mapping when combining */
+				if (DO_AGGSPLIT_COMBINE(aggnode->aggsplit))
+				{
+					aggstate->grpSetMappings[perhash->grpsetid].is_hashed = true;
+					aggstate->grpSetMappings[perhash->grpsetid].index = i;
+				}
+			}
+
 			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
 
 			for (j = 0; j < aggnode->numCols; ++j)
@@ -2363,18 +2587,21 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			AggStatePerPhase phasedata = &aggstate->phases[++phase];
 			int			num_sets;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			phasedata->numsets = num_sets = aggnode->rollup ?
+										list_length(aggnode->rollup->gsets) : 0;
 
 			if (num_sets)
 			{
 				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
 				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->grpsetids = palloc(num_sets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int			current_length = list_length(lfirst(l));
 					Bitmapset  *cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2382,12 +2609,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
-
+					phasedata->grpsetids[i] = gs->grpsetId;
 					++i;
 				}
 
 				all_grouped_cols = bms_add_members(all_grouped_cols,
 												   phasedata->grouped_cols[0]);
+
+				/* add a mapping when combining */
+				if (DO_AGGSPLIT_COMBINE(node->aggsplit))
+				{
+					aggstate->grpSetMappings[phasedata->grpsetids[0]].is_hashed = false;
+					aggstate->grpSetMappings[phasedata->grpsetids[0]].index = phase;
+				}
 			}
 			else
 			{
@@ -2871,23 +3105,50 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		if (!phase->aggnode)
 			continue;
 
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
+		if (aggstate->aggstrategy == AGG_MIXED &&
+			phaseidx == 1)
 		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
+			if (!DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+			{
+				/*
+				 * Phase one, and only phase one, in a mixed agg performs both
+				 * sorting and aggregation.
+				 */
+				dohash = true;
+				dosort = true;
+			}
+			else
+			{
+				/*
+				 * When combining partial groupingsets aggregate results, input
+				 * is dispatched according to the grouping set id, we cannot
+				 * perform both sorting and hashing aggregation in one phase,
+				 * just perform the sorting aggregation.
+				 */	
+				dohash = false;
+				dosort = true;
+			}
 		}
 		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
 		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
-			continue;
+			if (!DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+			{
+				/*
+				 * No need to compute a transition function for an AGG_MIXED phase
+				 * 0 - the contents of the hashtables will have been computed
+				 * during phase 1.
+				 */
+				continue;
+			}
+			else
+			{
+				/*
+				 * When combining partial groupingsets aggregate results, phase
+				 * 0 need to do its own hashing aggregate.
+				 */
+				dohash = true;
+				dosort = false;
+			}
 		}
 		else if (phase->aggstrategy == AGG_PLAIN ||
 				 phase->aggstrategy == AGG_SORTED)
@@ -3440,6 +3701,7 @@ ExecReScanAgg(AggState *node)
 	int			setno;
 
 	node->agg_done = false;
+	node->input_dispatched = false;
 
 	if (node->aggstrategy == AGG_HASHED)
 	{
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 8034d5a..bd8870f 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -989,7 +989,7 @@ _copyAgg(const Agg *from)
 	}
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
 
 	return newnode;
@@ -1477,6 +1477,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 }
 
 /*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(grpsetId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
+/*
  * _copyWindowFunc
  */
 static WindowFunc *
@@ -4956,6 +5000,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5589,6 +5636,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 9c8070c..b904263 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3062,6 +3062,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index d85ca9f..877ea0b 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -62,6 +62,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -740,6 +743,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1869,6 +1875,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2575,6 +2582,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index a53d473..aed2044 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -784,7 +784,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_OID_ARRAY(grpCollations, node->numCols);
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
 }
 
@@ -1149,6 +1149,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 }
 
 static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
+static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
 	WRITE_NODE_TYPE("WINDOWFUNC");
@@ -1999,6 +2006,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(grpsetId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3840,6 +3848,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index 81e7b94..056f03a 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -637,6 +637,50 @@ _readGroupingFunc(void)
 }
 
 /*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(grpsetId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
  * _readWindowFunc
  */
 static WindowFunc *
@@ -2201,7 +2245,7 @@ _readAgg(void)
 	READ_OID_ARRAY(grpCollations, local_node->numCols);
 	READ_LONG_FIELD(numGroups);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
 
 	READ_DONE();
@@ -2637,6 +2681,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 8286d9c..31f3cdf 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2708,6 +2708,9 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 						   NULL, rowsp);
 	add_path(rel, simple_gather_path);
 
+	if (root->parse->groupingSets)
+		return;
+
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
 	 * Merge.
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index dff826a..a35eaad 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1641,7 +1641,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 subplan);
@@ -2093,7 +2093,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					subplan);
@@ -2204,7 +2204,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	 * never be grouping in an UPDATE/DELETE; but let's Assert that.
 	 */
 	Assert(root->inhTargetKind == INHKIND_NONE);
-	Assert(root->grouping_map == NULL);
+//	Assert(root->grouping_map == NULL);
 	root->grouping_map = grouping_map;
 
 	/*
@@ -2249,12 +2249,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 sort_plan);
@@ -2287,12 +2287,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						subplan);
@@ -6194,7 +6194,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain,
+		 RollupData *rollup, List *chain,
 		 double dNumGroups, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6212,7 +6212,7 @@ make_agg(List *tlist, List *qual,
 	node->grpCollations = grpCollations;
 	node->numGroups = numGroups;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6f2153..365bf1c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -106,6 +106,7 @@ typedef struct
 typedef struct
 {
 	List	   *rollups;
+	List	   *final_rollups;
 	List	   *hash_sets_idx;
 	double		dNumHashGroups;
 	bool		any_hashable;
@@ -113,6 +114,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   numGroupingSets;
 } grouping_sets_data;
 
 /*
@@ -126,6 +128,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*add_path_callback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -142,7 +146,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -175,7 +180,10 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										List *havingQual,
+										AggSplit aggsplit);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -2442,6 +2450,8 @@ preprocess_grouping_sets(PlannerInfo *root)
 	int			maxref = 0;
 	ListCell   *lc;
 	ListCell   *lc_set;
+	ListCell   *lc_rollup;
+	RollupData *rollup;
 	grouping_sets_data *gd = palloc0(sizeof(grouping_sets_data));
 
 	parse->groupingSets = expand_grouping_sets(parse->groupingSets, -1);
@@ -2493,6 +2503,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->grpsetId = gd->numGroupingSets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2524,8 +2535,8 @@ preprocess_grouping_sets(PlannerInfo *root)
 	foreach(lc_set, sets)
 	{
 		List	   *current_sets = (List *) lfirst(lc_set);
-		RollupData *rollup = makeNode(RollupData);
 		GroupingSetData *gs;
+		rollup = makeNode(RollupData);
 
 		/*
 		 * Reorder the current list of grouping sets into correct prefix
@@ -2537,7 +2548,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -2589,6 +2600,33 @@ preprocess_grouping_sets(PlannerInfo *root)
 		gd->rollups = lappend(gd->rollups, rollup);
 	}
 
+	/* divide rollups to xxx */
+	foreach(lc_rollup, gd->rollups)
+	{
+		RollupData *initial_rollup = lfirst(lc_rollup);
+
+		foreach(lc, initial_rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst(lc);
+			rollup = makeNode(RollupData);
+
+			if (gs->set == NIL)
+				rollup->groupClause = NIL;	
+			else
+				rollup->groupClause = preprocess_groupclause(root, gs->set);
+			rollup->gsets_data = list_make1(gs);
+			rollup->gsets = remap_to_groupclause_idx(rollup->groupClause,
+													 rollup->gsets_data,
+													 gd->tleref_to_colnum_map);
+
+			rollup->numGroups = gs->numGroups;
+			rollup->hashable = initial_rollup->hashable;
+			rollup->is_hashed = initial_rollup->is_hashed;
+
+			gd->final_rollups = lappend(gd->final_rollups, rollup);
+		}
+	}
+
 	if (gd->unsortable_sets)
 	{
 		/*
@@ -3546,7 +3584,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3580,6 +3618,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->grpsetId = gd->numGroupingSets++;
 		result = lcons(gs, result);
 	}
 
@@ -3730,6 +3769,30 @@ get_number_of_groups(PlannerInfo *root,
 				dNumGroups += rollup->numGroups;
 			}
 
+			foreach(lc, gd->final_rollups)
+			{
+				RollupData *rollup = lfirst_node(RollupData, lc);
+				ListCell   *lc;
+
+				groupExprs = get_sortgrouplist_exprs(rollup->groupClause,
+													 target_list);
+
+				rollup->numGroups = 0.0;
+
+				forboth(lc, rollup->gsets, lc2, rollup->gsets_data)
+				{
+					List	   *gset = (List *) lfirst(lc);
+					GroupingSetData *gs = lfirst_node(GroupingSetData, lc2);
+					double		numGroups = estimate_num_groups(root,
+																groupExprs,
+																path_rows,
+																&gset);
+
+					gs->numGroups = numGroups;
+					rollup->numGroups += numGroups;
+				}
+			}
+
 			if (gd->hash_sets_idx)
 			{
 				ListCell   *lc;
@@ -4195,9 +4258,26 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							List *havingQual,
+							AggSplit aggsplit)
 {
-	Query	   *parse = root->parse;
+	/* For partial path, add it to partial_pathlist */
+	add_path_callback add_path_cb =
+		(aggsplit == AGGSPLIT_INITIAL_SERIAL) ? add_partial_path : add_path;
+
+	/* 
+	 * If we are combining the partial groupingsets aggregation, the input is
+	 * mixed with tuples from different grouping sets, executor dispatch the
+	 * tuples to different rollups (phases) according to the grouping set id.
+	 *
+	 * We cannot use the same rollups with initial stage in which each tuple
+	 * is processed by one or more grouping sets in one rollup, because in
+	 * combining stage, each tuple only belong to one single grouping set.
+	 * In this case, we use final_rollups instead in which each rollup has
+	 * only one grouping set.
+	 */
+	List *rollups = DO_AGGSPLIT_COMBINE(aggsplit) ? gd->final_rollups : gd->rollups;
 
 	/*
 	 * If we're not being offered sorted input, then only consider plans that
@@ -4218,7 +4298,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		List	   *empty_sets_data = NIL;
 		List	   *empty_sets = NIL;
 		ListCell   *lc;
-		ListCell   *l_start = list_head(gd->rollups);
+		ListCell   *l_start = list_head(rollups);
 		AggStrategy strat = AGG_HASHED;
 		double		hashsize;
 		double		exclude_groups = 0.0;
@@ -4250,7 +4330,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		{
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
-			l_start = lnext(gd->rollups, l_start);
+			l_start = lnext(rollups, l_start);
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4258,11 +4338,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  dNumGroups - exclude_groups);
 
 		/*
-		 * gd->rollups is empty if we have only unsortable columns to work
+		 * rollups is empty if we have only unsortable columns to work
 		 * with.  Override work_mem in that case; otherwise, we'll rely on the
 		 * sorted-input case to generate usable mixed paths.
 		 */
-		if (hashsize > work_mem * 1024L && gd->rollups)
+		if (hashsize > work_mem * 1024L && rollups)
 			return;				/* nope, won't fit */
 
 		/*
@@ -4271,7 +4351,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 		 */
 		sets_data = list_copy(gd->unsortable_sets);
 
-		for_each_cell(lc, gd->rollups, l_start)
+		for_each_cell(lc, rollups, l_start)
 		{
 			RollupData *rollup = lfirst_node(RollupData, lc);
 
@@ -4339,34 +4419,60 @@ consider_groupingsets_paths(PlannerInfo *root,
 		}
 		else if (empty_sets)
 		{
-			RollupData *rollup = makeNode(RollupData);
+			/*
+			 * If we are doing combining, each empty set is made to a single
+			 * rollup, otherwise, all empty sets are made to one rollup.
+			 */
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+			{
+				ListCell *lc2;
+				forboth(lc, empty_sets, lc2, empty_sets_data)
+				{
+					GroupingSetData *gs = lfirst_node(GroupingSetData, lc2);
+					RollupData *rollup = makeNode(RollupData);
+
+					rollup->groupClause = NIL;
+					rollup->gsets_data = list_make1(gs); 
+					rollup->gsets = list_make1(NIL);
+					rollup->numGroups = 1;
+					rollup->hashable = false;
+					rollup->is_hashed = false;
+					new_rollups = lappend(new_rollups, rollup);
+				}
+			}
+			else
+			{
+				RollupData *rollup = makeNode(RollupData);
+
+				rollup->groupClause = NIL;
+				rollup->gsets_data = empty_sets_data;
+				rollup->gsets = empty_sets;
+				rollup->numGroups = list_length(empty_sets);
+				rollup->hashable = false;
+				rollup->is_hashed = false;
+				new_rollups = lappend(new_rollups, rollup);
+			}
 
-			rollup->groupClause = NIL;
-			rollup->gsets_data = empty_sets_data;
-			rollup->gsets = empty_sets;
-			rollup->numGroups = list_length(empty_sets);
-			rollup->hashable = false;
-			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups));
+		add_path_cb(grouped_rel, (Path *)
+					  create_groupingsets_path(root,
+											   grouped_rel,
+											   path,
+											   havingQual,
+											   strat,
+											   new_rollups,
+											   agg_costs,
+											   dNumGroups,
+											   aggsplit));
 		return;
 	}
 
 	/*
 	 * If we have sorted input but nothing we can do with it, bail.
 	 */
-	if (list_length(gd->rollups) == 0)
+	if (list_length(rollups) == 0)
 		return;
 
 	/*
@@ -4379,7 +4485,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 */
 	if (can_hash && gd->any_hashable)
 	{
-		List	   *rollups = NIL;
+		List	   *mixed_rollups = NIL;
 		List	   *hash_sets = list_copy(gd->unsortable_sets);
 		double		availspace = (work_mem * 1024.0);
 		ListCell   *lc;
@@ -4391,10 +4497,10 @@ consider_groupingsets_paths(PlannerInfo *root,
 												 agg_costs,
 												 gd->dNumHashGroups);
 
-		if (availspace > 0 && list_length(gd->rollups) > 1)
+		if (availspace > 0 && list_length(rollups) > 1)
 		{
 			double		scale;
-			int			num_rollups = list_length(gd->rollups);
+			int			num_rollups = list_length(rollups);
 			int			k_capacity;
 			int		   *k_weights = palloc(num_rollups * sizeof(int));
 			Bitmapset  *hash_items = NULL;
@@ -4432,11 +4538,13 @@ consider_groupingsets_paths(PlannerInfo *root,
 			 * below, must use the same condition.
 			 */
 			i = 0;
-			for_each_cell(lc, gd->rollups, list_second_cell(gd->rollups))
+			for_each_cell(lc, rollups, list_second_cell(rollups))
 			{
 				RollupData *rollup = lfirst_node(RollupData, lc);
 
-				if (rollup->hashable)
+				/* Empty set cannot be hashed either */
+				if (rollup->hashable &&
+					list_length(linitial(rollup->gsets)) != 0)
 				{
 					double		sz = estimate_hashagg_tablesize(path,
 																agg_costs,
@@ -4463,30 +4571,31 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			if (!bms_is_empty(hash_items))
 			{
-				rollups = list_make1(linitial(gd->rollups));
+				mixed_rollups = list_make1(linitial(rollups));
 
 				i = 0;
-				for_each_cell(lc, gd->rollups, list_second_cell(gd->rollups))
+				for_each_cell(lc, rollups, list_second_cell(rollups))
 				{
 					RollupData *rollup = lfirst_node(RollupData, lc);
 
-					if (rollup->hashable)
+					if (rollup->hashable &&
+						list_length(linitial(rollup->gsets)) != 0)
 					{
 						if (bms_is_member(i, hash_items))
 							hash_sets = list_concat(hash_sets,
 													rollup->gsets_data);
 						else
-							rollups = lappend(rollups, rollup);
+							mixed_rollups = lappend(mixed_rollups, rollup);
 						++i;
 					}
 					else
-						rollups = lappend(rollups, rollup);
+						mixed_rollups = lappend(mixed_rollups, rollup);
 				}
 			}
 		}
 
-		if (!rollups && hash_sets)
-			rollups = list_copy(gd->rollups);
+		if (!mixed_rollups && hash_sets)
+			mixed_rollups = list_copy(rollups);
 
 		foreach(lc, hash_sets)
 		{
@@ -4503,20 +4612,21 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			mixed_rollups = lcons(rollup, mixed_rollups);
 		}
 
-		if (rollups)
+		if (mixed_rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups));
+			add_path_cb(grouped_rel, (Path *)
+						  create_groupingsets_path(root,
+												   grouped_rel,
+												   path,
+												   havingQual,
+												   AGG_MIXED,
+												   mixed_rollups,
+												   agg_costs,
+												   dNumGroups,
+												   aggsplit));
 		}
 	}
 
@@ -4524,15 +4634,16 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups));
+		add_path_cb(grouped_rel, (Path *)
+					  create_groupingsets_path(root,
+											   grouped_rel,
+											   path,
+											   havingQual,
+											   AGG_SORTED,
+											   rollups,
+											   agg_costs,
+											   dNumGroups,
+											   aggsplit));
 }
 
 /*
@@ -5247,6 +5358,13 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
+	/* Add  */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
 	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
@@ -6417,7 +6535,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups,
+												havingQual,
+												AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6484,7 +6604,15 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGGSPLIT_FINAL_DESERIAL);
+				}
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6519,7 +6647,9 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										havingQual,
+										AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6562,22 +6692,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGGSPLIT_FINAL_DESERIAL);
+			}
+			else
+			{
 
-			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
+
+				if (hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6794,8 +6939,16 @@ create_partial_grouping_paths(PlannerInfo *root,
 													 path,
 													 root->group_pathkeys,
 													 -1.0);
-
-				if (parse->hasAggs)
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, true, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGGSPLIT_INITIAL_SERIAL);
+				}
+				else if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
 									 create_agg_path(root,
 													 partially_grouped_rel,
@@ -6856,26 +7009,39 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if (hashaggtablesize < work_mem * 1024L &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGGSPLIT_INITIAL_SERIAL);
+		}
+		else 
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if (hashaggtablesize < work_mem * 1024L &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6918,6 +7084,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	/* Try Gather for unordered paths and Gather Merge for ordered ones. */
 	generate_gather_paths(root, rel, true);
 
+	if (root->parse->groupingSets)
+		return;
+
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
@@ -6963,11 +7132,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index ebb0a59..c26b304 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -752,6 +752,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If this node is combining partial-groupingsets-aggregation,
+					 * we must add reference to the GroupingSetsId expression in
+					 * the targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->grpSetIdFilter = fix_upper_expr(root, (Node *)expr,
+															 subplan_itlist,
+															 OUTER_VAR,
+															 rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08ae..8a6d562 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2991,7 +2991,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 AggSplit aggsplit)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3009,6 +3010,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit= aggsplit;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 0018ffc..f8f8075 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -8017,6 +8017,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 7112558..05c71aa 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
@@ -226,6 +227,7 @@ typedef enum ExprEvalOp
 	EEOP_AGG_STRICT_INPUT_CHECK_ARGS,
 	EEOP_AGG_STRICT_INPUT_CHECK_NULLS,
 	EEOP_AGG_INIT_TRANS,
+	EEOP_AGG_PERHASH_NULL_CHECK,
 	EEOP_AGG_STRICT_TRANS_CHECK,
 	EEOP_AGG_PLAIN_TRANS_BYVAL,
 	EEOP_AGG_PLAIN_TRANS,
@@ -573,6 +575,12 @@ typedef struct ExprEvalStep
 			List	   *clauses;	/* integer list of column numbers */
 		}			grouping_func;
 
+		/* for EEOP_GROUPING_SET_ID */
+		struct
+		{
+			AggState   *parent; /* parent Agg */
+		}			grouping_set_id;
+
 		/* for EEOP_WINDOW_FUNC */
 		struct
 		{
@@ -634,6 +642,17 @@ typedef struct ExprEvalStep
 			int			jumpnull;
 		}			agg_init_trans;
 
+		/* for EEOP_AGG_PERHASH_NULL_CHECK */
+		struct
+		{
+			AggState   *aggstate;
+			AggStatePerTrans pertrans;
+			int			setno;
+			int			transno;
+			int			setoff;
+			int			jumpnull;
+		}			agg_perhash_null_check;
+
 		/* for EEOP_AGG_STRICT_TRANS_CHECK */
 		struct
 		{
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 2fe82da..465c36b 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -280,6 +280,11 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+
+	/* field for parallel grouping sets */
+	int *grpsetids;
+	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
+	Tuplestorestate *store_in;	/* sorted input to phases > 1 */
 }			AggStatePerPhaseData;
 
 /*
@@ -302,8 +307,10 @@ typedef struct AggStatePerHashData
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
 	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
 
+	/* field for parallel grouping sets */
+	int grpsetid;
+}			AggStatePerHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index eaea1f3..bcbc81b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2019,6 +2019,13 @@ typedef struct GroupState
  *	expressions and run the aggregate transition functions.
  * ---------------------
  */
+/* mapping from grouping set id to perphase or perhash data */
+typedef struct GrpSetMapping
+{
+	bool	is_hashed;
+	int		index; 		/* index of aggstate->perhash[] or aggstate->phases[]*/
+} GrpSetMapping;
+
 /* these structs are private in nodeAgg.c: */
 typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
@@ -2060,6 +2067,7 @@ typedef struct AggState
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
+	Tuplestorestate *store_in;	/* sorted input to phases > 1 */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
 									 * pointers */
@@ -2076,6 +2084,12 @@ typedef struct AggState
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* support for parallel grouping sets */
+	bool input_dispatched;
+	ExprState *grpsetid_filter;				/* filter to fetch grouping set id
+											   from child targetlist */
+	struct GrpSetMapping *grpSetMappings;	/* grpsetid <-> perhash or perphase data */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index baced7e..31f7cd1 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be19..6093786 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1675,6 +1675,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			grpsetId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1700,6 +1701,7 @@ typedef struct GroupingSetsPath
 	AggStrategy aggstrategy;	/* basic strategy */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
+	AggSplit   aggsplit;
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 32c0d87..f5c9af0 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -815,8 +816,9 @@ typedef struct Agg
 	long		numGroups;		/* estimated number of groups in input */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Node	   *grpSetIdFilter;
 } Agg;
 
 /* ----------------
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index eb2cacb..d638b7f 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -350,6 +350,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* add comment */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe1..e4fd3b1 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  AggSplit aggsplit);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index eab486a..95a739a 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain,
+					 RollupData *rollup, List *chain,
 					 double dNumGroups, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
-- 
2.7.4

#14

Richard Guo

riguo@pivotal.io

almost 6 years ago

In reply to: Richard Guo (#13)

1 attachment(s)

Re: Parallel grouping sets

I realized that there are two patches in this thread that are
implemented according to different methods, which causes confusion. So I
decide to update this thread with only one patch, i.e. the patch for
'Implementation 1' as described in the first email and then move the
other patch to a separate thread.

With this idea, here is the patch for 'Implementation 1' that is rebased
with the latest master.

Thanks
Richard

On Wed, Jan 8, 2020 at 3:24 PM Richard Guo <riguo@pivotal.io> wrote:

Show quoted text

On Sun, Dec 1, 2019 at 10:03 AM Michael Paquier <michael@paquier.xyz>
wrote:

On Thu, Nov 28, 2019 at 07:07:22PM +0800, Pengzhou Tang wrote:

Richard pointed out that he get incorrect results with the patch I
attached, there are bugs somewhere,
I fixed them now and attached the newest version, please refer to [1]

for

the fix.

Mr Robot is reporting that the latest patch fails to build at least on
Windows. Could you please send a rebase? I have moved for now the
patch to next CF, waiting on author.

Thanks for reporting this issue. Here is the rebase.

Thanks
Richard

Attachments:

v5-0001-Implementing-parallel-grouping-sets.patchapplication/octet-stream; name=v5-0001-Implementing-parallel-grouping-sets.patchDownload

From dda1cc13310c2c4efb402114951b54087bf04de8 Mon Sep 17 00:00:00 2001
From: Richard Guo <riguo@pivotal.io>
Date: Sun, 19 Jan 2020 12:18:29 +0000
Subject: [PATCH] Implementing parallel grouping sets.

Parallel aggregation has already been supported in PostgreSQL and it is
implemented by aggregating in two stages. First, each worker performs an
aggregation step, producing a partial result for each group of which
that process is aware. Second, the partial results are transferred to
the leader via the Gather node. Finally, the leader merges the partial
results and produces the final result for each group.

We are implementing parallel grouping sets in the same way. The only
difference is that in the final stage, the leader performs a grouping
sets aggregation, rather than a normal aggregation.
---
 src/backend/optimizer/plan/createplan.c            |   4 +-
 src/backend/optimizer/plan/planner.c               | 137 ++++++++++----
 src/backend/optimizer/util/pathnode.c              |   2 +
 src/include/nodes/pathnodes.h                      |   1 +
 src/include/optimizer/pathnode.h                   |   1 +
 .../regress/expected/groupingsets_parallel.out     | 201 +++++++++++++++++++++
 src/test/regress/parallel_schedule                 |   1 +
 src/test/regress/serial_schedule                   |   1 +
 src/test/regress/sql/groupingsets_parallel.sql     |  50 +++++
 9 files changed, 363 insertions(+), 35 deletions(-)
 create mode 100644 src/test/regress/expected/groupingsets_parallel.out
 create mode 100644 src/test/regress/sql/groupingsets_parallel.sql

diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index dff826a..2181965 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2249,7 +2249,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
@@ -2287,7 +2287,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index d6f2153..fb60454 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -147,7 +147,8 @@ static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
 								   grouping_sets_data *gd,
-								   List *target_list);
+								   List *target_list,
+								   bool is_partial);
 static RelOptInfo *create_grouping_paths(PlannerInfo *root,
 										 RelOptInfo *input_rel,
 										 PathTarget *target,
@@ -175,7 +176,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggSplit aggsplit);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -3676,6 +3678,7 @@ standard_qp_callback(PlannerInfo *root, void *extra)
  * path_rows: number of output rows from scan/join step
  * gd: grouping sets data including list of grouping sets and their clauses
  * target_list: target list containing group clause references
+ * is_partial: whether the grouping is in partial aggregate
  *
  * If doing grouping sets, we also annotate the gsets data with the estimates
  * for each set and each individual rollup list, with a view to later
@@ -3685,7 +3688,8 @@ static double
 get_number_of_groups(PlannerInfo *root,
 					 double path_rows,
 					 grouping_sets_data *gd,
-					 List *target_list)
+					 List *target_list,
+					 bool is_partial)
 {
 	Query	   *parse = root->parse;
 	double		dNumGroups;
@@ -3694,7 +3698,15 @@ get_number_of_groups(PlannerInfo *root,
 	{
 		List	   *groupExprs;
 
-		if (parse->groupingSets)
+		/*
+		 * Grouping sets
+		 *
+		 * If we are doing partial aggregation for grouping sets, we are
+		 * supposed to estimate number of groups based on all the columns in
+		 * parse->groupClause.  Otherwise, we can add up the estimates for
+		 * each grouping set.
+		 */
+		if (parse->groupingSets && !is_partial)
 		{
 			/* Add up the estimates for each grouping set */
 			ListCell   *lc;
@@ -3757,7 +3769,7 @@ get_number_of_groups(PlannerInfo *root,
 		}
 		else
 		{
-			/* Plain GROUP BY */
+			/* Plain GROUP BY, or grouping is in partial aggregate */
 			groupExprs = get_sortgrouplist_exprs(parse->groupClause,
 												 target_list);
 
@@ -4150,7 +4162,8 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
 	dNumGroups = get_number_of_groups(root,
 									  cheapest_path->rows,
 									  gd,
-									  extra->targetList);
+									  extra->targetList,
+									  false);
 
 	/* Build final grouping paths */
 	add_paths_to_grouping_rel(root, input_rel, grouped_rel,
@@ -4195,7 +4208,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggSplit aggsplit)
 {
 	Query	   *parse = root->parse;
 
@@ -4357,6 +4371,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  strat,
+										  aggsplit,
 										  new_rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -4514,6 +4529,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  path,
 											  (List *) parse->havingQual,
 											  AGG_MIXED,
+											  aggsplit,
 											  rollups,
 											  agg_costs,
 											  dNumGroups));
@@ -4530,6 +4546,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  path,
 										  (List *) parse->havingQual,
 										  AGG_SORTED,
+										  aggsplit,
 										  gd->rollups,
 										  agg_costs,
 										  dNumGroups));
@@ -5204,7 +5221,15 @@ make_partial_grouping_target(PlannerInfo *root,
 	foreach(lc, grouping_target->exprs)
 	{
 		Expr	   *expr = (Expr *) lfirst(lc);
-		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i);
+		Index		sgref = get_pathtarget_sortgroupref(grouping_target, i++);
+
+		/*
+		 * GroupingFunc does not need to be evaluated in Partial Aggregate,
+		 * since Partial Aggregate will not handle multiple grouping sets at
+		 * once.
+		 */
+		if (IsA(expr, GroupingFunc))
+			continue;
 
 		if (sgref && parse->groupClause &&
 			get_sortgroupref_clause_noerr(sgref, parse->groupClause) != NULL)
@@ -5223,8 +5248,6 @@ make_partial_grouping_target(PlannerInfo *root,
 			 */
 			non_group_cols = lappend(non_group_cols, expr);
 		}
-
-		i++;
 	}
 
 	/*
@@ -6417,7 +6440,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 				{
 					consider_groupingsets_paths(root, grouped_rel,
 												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
+												gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 				}
 				else if (parse->hasAggs)
 				{
@@ -6484,7 +6507,14 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 -1.0);
 				}
 
-				if (parse->hasAggs)
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, true, can_hash,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else if (parse->hasAggs)
 					add_path(grouped_rel, (Path *)
 							 create_agg_path(root,
 											 grouped_rel,
@@ -6519,7 +6549,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups, AGGSPLIT_SIMPLE);
 		}
 		else
 		{
@@ -6567,17 +6597,27 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 														  dNumGroups);
 
 			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+			{
+				/*
+				 * parallel grouping sets
+				 */
+				if (parse->groupingSets)
+					consider_groupingsets_paths(root, grouped_rel,
+												path, false, true,
+												gd, agg_final_costs, dNumGroups, AGGSPLIT_FINAL_DESERIAL);
+				else
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6717,13 +6757,15 @@ create_partial_grouping_paths(PlannerInfo *root,
 			get_number_of_groups(root,
 								 cheapest_total_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 	if (cheapest_partial_path != NULL)
 		dNumPartialPartialGroups =
 			get_number_of_groups(root,
 								 cheapest_partial_path->rows,
 								 gd,
-								 extra->targetList);
+								 extra->targetList,
+								 true);
 
 	if (can_sort && cheapest_total_path != NULL)
 	{
@@ -6745,11 +6787,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_path(partially_grouped_rel, (Path *)
@@ -6789,11 +6848,28 @@ create_partial_grouping_paths(PlannerInfo *root,
 			{
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
+				{
+					List *pathkeys;
+
+					/*
+					 * If we are performing Partial Aggregate for grouping
+					 * sets, we need to sort by all the columns in
+					 * parse->groupClause.
+					 */
+					if (parse->groupingSets)
+						pathkeys =
+							make_pathkeys_for_sortclauses(root,
+														  parse->groupClause,
+														  root->processed_tlist);
+					else
+						pathkeys = root->group_pathkeys;
+
 					path = (Path *) create_sort_path(root,
 													 partially_grouped_rel,
 													 path,
-													 root->group_pathkeys,
+													 pathkeys,
 													 -1.0);
+				}
 
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
@@ -6963,11 +7039,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index e6d08ae..8d93f63 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2989,6 +2989,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 Path *subpath,
 						 List *having_qual,
 						 AggStrategy aggstrategy,
+						 AggSplit aggsplit,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups)
@@ -3034,6 +3035,7 @@ create_groupingsets_path(PlannerInfo *root,
 		pathnode->path.pathkeys = NIL;
 
 	pathnode->aggstrategy = aggstrategy;
+	pathnode->aggsplit = aggsplit;
 	pathnode->rollups = rollups;
 	pathnode->qual = having_qual;
 
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3d3be19..9094b69 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1698,6 +1698,7 @@ typedef struct GroupingSetsPath
 	Path		path;
 	Path	   *subpath;		/* path representing input source */
 	AggStrategy aggstrategy;	/* basic strategy */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 } GroupingSetsPath;
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe1..69d46cf 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -215,6 +215,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  Path *subpath,
 												  List *having_qual,
 												  AggStrategy aggstrategy,
+												  AggSplit aggsplit,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups);
diff --git a/src/test/regress/expected/groupingsets_parallel.out b/src/test/regress/expected/groupingsets_parallel.out
new file mode 100644
index 0000000..9151960
--- /dev/null
+++ b/src/test/regress/expected/groupingsets_parallel.out
@@ -0,0 +1,201 @@
+--
+-- parallel grouping sets
+--
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+create table gstest1(c1 int, c2 int, c3 int);
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+insert into gstest1 select a,b,1 from generate_series(1,100) a, generate_series(1,100) b;
+analyze gstest1;
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+-- negative case
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest1 group by grouping sets((c1),(c2));
+            QUERY PLAN            
+----------------------------------
+ HashAggregate
+   Output: c1, c2, avg(c3)
+   Hash Key: gstest1.c1
+   Hash Key: gstest1.c2
+   ->  Seq Scan on public.gstest1
+         Output: c1, c2, c3
+(6 rows)
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+                         QUERY PLAN                         
+------------------------------------------------------------
+ Sort
+   Output: c1, c2, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, (avg(gstest.c3))
+   ->  Finalize HashAggregate
+         Output: c1, c2, avg(c3)
+         Hash Key: gstest.c1, gstest.c2
+         Hash Key: gstest.c1
+         ->  Gather
+               Output: c1, c2, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial HashAggregate
+                     Output: c1, c2, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(15 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+                           QUERY PLAN                           
+----------------------------------------------------------------
+ Sort
+   Output: c1, c2, c3, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, gstest.c3, (avg(gstest.c3))
+   ->  Finalize HashAggregate
+         Output: c1, c2, c3, avg(c3)
+         Hash Key: gstest.c1, gstest.c2
+         Hash Key: gstest.c1
+         Hash Key: gstest.c2, gstest.c3
+         ->  Gather
+               Output: c1, c2, c3, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial HashAggregate
+                     Output: c1, c2, c3, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Parallel Seq Scan on public.gstest
+                           Output: c1, c2, c3
+(16 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+                            QUERY PLAN                            
+------------------------------------------------------------------
+ Sort
+   Output: c1, c2, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, (avg(gstest.c3))
+   ->  Finalize GroupAggregate
+         Output: c1, c2, avg(c3)
+         Group Key: gstest.c1, gstest.c2
+         Group Key: gstest.c1
+         ->  Gather Merge
+               Output: c1, c2, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial GroupAggregate
+                     Output: c1, c2, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2
+                     ->  Sort
+                           Output: c1, c2, c3
+                           Sort Key: gstest.c1, gstest.c2
+                           ->  Parallel Seq Scan on public.gstest
+                                 Output: c1, c2, c3
+(18 rows)
+
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+ c1 | c2 |          avg           
+----+----+------------------------
+  1 | 10 |   150.0000000000000000
+  1 | 20 |    30.0000000000000000
+  1 |    | 0.00000000000000000000
+  1 |    |    82.5000000000000000
+  2 | 30 |    40.0000000000000000
+  2 | 40 |    50.0000000000000000
+  2 |    |    45.0000000000000000
+  3 | 50 |    60.0000000000000000
+  3 |    |    60.0000000000000000
+(9 rows)
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+                             QUERY PLAN                              
+---------------------------------------------------------------------
+ Sort
+   Output: c1, c2, c3, (avg(c3))
+   Sort Key: gstest.c1, gstest.c2, gstest.c3, (avg(gstest.c3))
+   ->  Finalize GroupAggregate
+         Output: c1, c2, c3, avg(c3)
+         Group Key: gstest.c1, gstest.c2
+         Group Key: gstest.c1
+         Sort Key: gstest.c2, gstest.c3
+           Group Key: gstest.c2, gstest.c3
+         ->  Gather Merge
+               Output: c1, c2, c3, (PARTIAL avg(c3))
+               Workers Planned: 4
+               ->  Partial GroupAggregate
+                     Output: c1, c2, c3, PARTIAL avg(c3)
+                     Group Key: gstest.c1, gstest.c2, gstest.c3
+                     ->  Sort
+                           Output: c1, c2, c3
+                           Sort Key: gstest.c1, gstest.c2, gstest.c3
+                           ->  Parallel Seq Scan on public.gstest
+                                 Output: c1, c2, c3
+(20 rows)
+
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+ c1 | c2 | c3  |          avg           
+----+----+-----+------------------------
+  1 | 10 |     |   150.0000000000000000
+  1 | 20 |     |    30.0000000000000000
+  1 |    |     | 0.00000000000000000000
+  1 |    |     |    82.5000000000000000
+  2 | 30 |     |    40.0000000000000000
+  2 | 40 |     |    50.0000000000000000
+  2 |    |     |    45.0000000000000000
+  3 | 50 |     |    60.0000000000000000
+  3 |    |     |    60.0000000000000000
+    | 10 | 100 |   100.0000000000000000
+    | 10 | 200 |   200.0000000000000000
+    | 20 |  30 |    30.0000000000000000
+    | 30 |  40 |    40.0000000000000000
+    | 40 |  50 |    50.0000000000000000
+    | 50 |  60 |    60.0000000000000000
+    |    |   0 | 0.00000000000000000000
+(16 rows)
+
+drop table gstest;
+drop table gstest1;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index d33a4e1..8e18484 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -88,6 +88,7 @@ test: rules psql psql_crosstab amutils stats_ext collate.linux.utf8
 # run by itself so it can run parallel workers
 test: select_parallel
 test: write_parallel
+test: groupingsets_parallel
 
 # no relation related tests can be put in this group
 test: publication subscription
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index f86f5c5..36ee9db 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -142,6 +142,7 @@ test: stats_ext
 test: collate.linux.utf8
 test: select_parallel
 test: write_parallel
+test: groupingsets_parallel
 test: publication
 test: subscription
 test: select_views
diff --git a/src/test/regress/sql/groupingsets_parallel.sql b/src/test/regress/sql/groupingsets_parallel.sql
new file mode 100644
index 0000000..fd71920
--- /dev/null
+++ b/src/test/regress/sql/groupingsets_parallel.sql
@@ -0,0 +1,50 @@
+--
+-- parallel grouping sets
+--
+
+-- test data sources
+create table gstest(c1 int, c2 int, c3 int) with (parallel_workers = 4);
+create table gstest1(c1 int, c2 int, c3 int);
+
+insert into gstest select 1,10,100 from generate_series(1,10)i;
+insert into gstest select 1,10,200 from generate_series(1,10)i;
+insert into gstest select 1,20,30 from generate_series(1,10)i;
+insert into gstest select 2,30,40 from generate_series(1,10)i;
+insert into gstest select 2,40,50 from generate_series(1,10)i;
+insert into gstest select 3,50,60 from generate_series(1,10)i;
+insert into gstest select 1,NULL,0 from generate_series(1,10)i;
+analyze gstest;
+
+insert into gstest1 select a,b,1 from generate_series(1,100) a, generate_series(1,100) b;
+analyze gstest1;
+
+SET parallel_tuple_cost=0;
+SET parallel_setup_cost=0;
+SET max_parallel_workers_per_gather=4;
+
+-- negative case
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest1 group by grouping sets((c1),(c2));
+
+-- test for hashagg
+set enable_hashagg to on;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+
+-- test for groupagg
+set enable_hashagg to off;
+explain (costs off, verbose)
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+select c1, c2, avg(c3) from gstest group by grouping sets((c1,c2),(c1)) order by 1,2,3;
+
+explain (costs off, verbose)
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3)) order by 1,2,3,4;
+
+drop table gstest;
+drop table gstest1;
-- 
2.7.4

#15

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Richard Guo (#14)

Re: Parallel grouping sets

On Sun, Jan 19, 2020 at 2:23 PM Richard Guo <riguo@pivotal.io> wrote:

I realized that there are two patches in this thread that are
implemented according to different methods, which causes confusion.

Both the idea seems to be different. Is the second approach [1]/messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com
inferior for any case as compared to the first approach? Can we keep
both approaches for parallel grouping sets, if so how? If not, then
won't the code by the first approach be useless once we commit second
approach?

[1]: /messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#16

Jesse Zhang

sbjesse@gmail.com

almost 6 years ago

In reply to: Amit Kapila (#15)

Re: Parallel grouping sets

On Thu, Jan 23, 2020 at 2:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jan 19, 2020 at 2:23 PM Richard Guo <riguo@pivotal.io> wrote:

I realized that there are two patches in this thread that are
implemented according to different methods, which causes confusion.

Both the idea seems to be different. Is the second approach [1]
inferior for any case as compared to the first approach? Can we keep
both approaches for parallel grouping sets, if so how? If not, then
won't the code by the first approach be useless once we commit second
approach?

[1] - /messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com

I glanced over both patches. Just the opposite, I have a hunch that v3
is always better than v5. Here's my 6-minute understanding of both.

v5 (the one with a simple partial aggregate) works by pushing a little
bit of partial aggregate onto workers, and perform grouping aggregate
above gather. This has two interesting outcomes: we can execute
unmodified partial aggregate on the workers, and execute almost
unmodified rollup aggreegate once the trans values are gathered. A
parallel plan for a query like

SELECT count(*) FROM foo GROUP BY GROUPING SETS (a), (b), (c), ();

can be

Finalize GroupAggregate
Output: count(*)
Group Key: a
Group Key: b
Group Key: c
Group Key: ()
Gather Merge
Partial GroupAggregate
Output: PARTIAL count(*)
Group Key: a, b, c
Sort
Sort Key: a, b, c
Parallel Seq Scan on foo

v3 ("the one with grouping set id") really turns the plan from a tree to
a multiplexed pipe: we can execute grouping aggregate on the workers,
but only partially. When we emit the trans values, also tag the tuple
with a group id. After gather, finalize the aggregates with a modified
grouping aggregate. Unlike a non-split grouping aggregate, the finalize
grouping aggregate does not "flow" the results from one rollup to the
next one. Instead, each group only advances on partial inputs tagged for
the group.

Finalize HashAggregate
Output: count(*)
Dispatched by: (GroupingSetID())
Group Key: a
Group Key: b
Group Key: c
Gather
Partial GroupAggregate
Output: PARTIAL count(*), GroupingSetID()
Group Key: a
Sort Key: b
Group Key: b
Sort Key: c
Group Key: c
Sort
Sort Key: a
Parallel Seq Scan on foo

Note that for the first approach to be viable, the partial aggregate
*has to* use a group key that's the union of all grouping sets. In cases
where individual columns have a low cardinality but joint cardinality is
high (say columns a, b, c each has 16 distinct values, but they are
independent, so there are 4096 distinct values on (a,b,c)), this results
in fairly high traffic through the shm tuple queue.

Cheers,
Jesse

#17

Amit Kapila

amit.kapila16@gmail.com

almost 6 years ago

In reply to: Jesse Zhang (#16)

Re: Parallel grouping sets

On Sat, Jan 25, 2020 at 4:22 AM Jesse Zhang <sbjesse@gmail.com> wrote:

On Thu, Jan 23, 2020 at 2:47 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Jan 19, 2020 at 2:23 PM Richard Guo <riguo@pivotal.io> wrote:

I realized that there are two patches in this thread that are
implemented according to different methods, which causes confusion.

Both the idea seems to be different. Is the second approach [1]
inferior for any case as compared to the first approach? Can we keep
both approaches for parallel grouping sets, if so how? If not, then
won't the code by the first approach be useless once we commit second
approach?

[1] - /messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com

I glanced over both patches. Just the opposite, I have a hunch that v3
is always better than v5.

This is what I also understood after reading this thread. So, my
question is why not just review v3 and commit something on those lines
even though it would take a bit more time. It is possible that if we
decide to go with v5, we can make it happen earlier, but later when we
try to get v3, the code committed as part of v5 might not be of any
use or if it is useful, then in which cases?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18

Richard Guo

riguo@pivotal.io

almost 6 years ago

In reply to: Jesse Zhang (#16)

Re: Parallel grouping sets

Hi Jesse,

Thanks for reviewing these two patches.

On Sat, Jan 25, 2020 at 6:52 AM Jesse Zhang <sbjesse@gmail.com> wrote:

I glanced over both patches. Just the opposite, I have a hunch that v3
is always better than v5. Here's my 6-minute understanding of both.

v5 (the one with a simple partial aggregate) works by pushing a little
bit of partial aggregate onto workers, and perform grouping aggregate
above gather. This has two interesting outcomes: we can execute
unmodified partial aggregate on the workers, and execute almost
unmodified rollup aggreegate once the trans values are gathered. A
parallel plan for a query like

SELECT count(*) FROM foo GROUP BY GROUPING SETS (a), (b), (c), ();

can be

Finalize GroupAggregate
Output: count(*)
Group Key: a
Group Key: b
Group Key: c
Group Key: ()
Gather Merge
Partial GroupAggregate
Output: PARTIAL count(*)
Group Key: a, b, c
Sort
Sort Key: a, b, c
Parallel Seq Scan on foo

Yes, this is the idea of v5 patch.

v3 ("the one with grouping set id") really turns the plan from a tree to
a multiplexed pipe: we can execute grouping aggregate on the workers,
but only partially. When we emit the trans values, also tag the tuple
with a group id. After gather, finalize the aggregates with a modified
grouping aggregate. Unlike a non-split grouping aggregate, the finalize
grouping aggregate does not "flow" the results from one rollup to the
next one. Instead, each group only advances on partial inputs tagged for
the group.

Finalize HashAggregate
Output: count(*)
Dispatched by: (GroupingSetID())
Group Key: a
Group Key: b
Group Key: c
Gather
Partial GroupAggregate
Output: PARTIAL count(*), GroupingSetID()
Group Key: a
Sort Key: b
Group Key: b
Sort Key: c
Group Key: c
Sort
Sort Key: a
Parallel Seq Scan on foo

Yes, this is what v3 patch does.

We (Pengzhou and I) had an offline discussion on this plan and we have
some other idea. Since we have tagged 'GroupingSetId' for each tuple
produced by partial aggregate, why not then perform a normal grouping
sets aggregation in the final phase, with the 'GroupingSetId' included
in the group keys? The plan looks like:

# explain (costs off, verbose)
select c1, c2, c3, avg(c3) from gstest group by grouping
sets((c1,c2),(c1),(c2,c3));
QUERY PLAN
------------------------------------------------------------------
Finalize GroupAggregate
Output: c1, c2, c3, avg(c3)
Group Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
-> Sort
Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
Sort Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
-> Gather
Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
Workers Planned: 4
-> Partial HashAggregate
Output: c1, c2, c3, gset_id, PARTIAL avg(c3)
Hash Key: gstest.c1, gstest.c2
Hash Key: gstest.c1
Hash Key: gstest.c2, gstest.c3
-> Parallel Seq Scan on public.gstest
Output: c1, c2, c3

This plan should be able to give the correct results. We are still
thinking if it is a better plan than the 'multiplexed pipe' plan as in
v3. Inputs of thoughts here would be appreciated.

Note that for the first approach to be viable, the partial aggregate
*has to* use a group key that's the union of all grouping sets. In cases

where individual columns have a low cardinality but joint cardinality is

high (say columns a, b, c each has 16 distinct values, but they are
independent, so there are 4096 distinct values on (a,b,c)), this results
in fairly high traffic through the shm tuple queue.

Yes, you are right. This is the case mentioned by David earlier in [1]/messages/by-id/CAKJS1f8Q9muALhkapbnO3bPUgAmZkWq9tM_crk8o9=JiiOPWsg@mail.gmail.com.
In this case, ideally the parallel plan would fail when competing with
non-parallel plan in add_path() and so not be chosen.

[1]: /messages/by-id/CAKJS1f8Q9muALhkapbnO3bPUgAmZkWq9tM_crk8o9=JiiOPWsg@mail.gmail.com
/messages/by-id/CAKJS1f8Q9muALhkapbnO3bPUgAmZkWq9tM_crk8o9=JiiOPWsg@mail.gmail.com

Thanks
Richard

#19

Richard Guo

riguo@pivotal.io

almost 6 years ago

In reply to: Amit Kapila (#17)

Re: Parallel grouping sets

Hi Amit,

Thanks for reviewing these two patches.

On Sat, Jan 25, 2020 at 6:31 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

This is what I also understood after reading this thread. So, my
question is why not just review v3 and commit something on those lines
even though it would take a bit more time. It is possible that if we
decide to go with v5, we can make it happen earlier, but later when we
try to get v3, the code committed as part of v5 might not be of any
use or if it is useful, then in which cases?

Yes, approach #2 (v3) would be generally better than approach #1 (v5) in
performance. I started with approach #1 because it is much easier.

If we decide to go with approach #2, I think we can now concentrate on
v3 patch.

For v3 patch, we have some other idea, which is to perform a normal
grouping sets aggregation in the final phase, with 'GroupingSetId'
included in the group keys (as described in the previous email). With
this idea, we can avoid a lot of hacky codes in current v3 patch.

Thanks
Richard

#20

Jesse Zhang

sbjesse@gmail.com

almost 6 years ago

In reply to: Richard Guo (#18)

Re: Parallel grouping sets

On Mon, Feb 3, 2020 at 12:07 AM Richard Guo <riguo@pivotal.io> wrote:

Hi Jesse,

Thanks for reviewing these two patches.

I enjoyed it!

On Sat, Jan 25, 2020 at 6:52 AM Jesse Zhang <sbjesse@gmail.com> wrote:

I glanced over both patches. Just the opposite, I have a hunch that v3
is always better than v5. Here's my 6-minute understanding of both.

v3 ("the one with grouping set id") really turns the plan from a tree to
a multiplexed pipe: we can execute grouping aggregate on the workers,
but only partially. When we emit the trans values, also tag the tuple
with a group id. After gather, finalize the aggregates with a modified
grouping aggregate. Unlike a non-split grouping aggregate, the finalize
grouping aggregate does not "flow" the results from one rollup to the
next one. Instead, each group only advances on partial inputs tagged for
the group.

Yes, this is what v3 patch does.

We (Pengzhou and I) had an offline discussion on this plan and we have
some other idea. Since we have tagged 'GroupingSetId' for each tuple
produced by partial aggregate, why not then perform a normal grouping
sets aggregation in the final phase, with the 'GroupingSetId' included
in the group keys? The plan looks like:

# explain (costs off, verbose)
select c1, c2, c3, avg(c3) from gstest group by grouping sets((c1,c2),(c1),(c2,c3));
QUERY PLAN
------------------------------------------------------------------
Finalize GroupAggregate
Output: c1, c2, c3, avg(c3)
Group Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
-> Sort
Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
Sort Key: (gset_id), gstest.c1, gstest.c2, gstest.c3
-> Gather
Output: c1, c2, c3, (gset_id), (PARTIAL avg(c3))
Workers Planned: 4
-> Partial HashAggregate
Output: c1, c2, c3, gset_id, PARTIAL avg(c3)
Hash Key: gstest.c1, gstest.c2
Hash Key: gstest.c1
Hash Key: gstest.c2, gstest.c3
-> Parallel Seq Scan on public.gstest
Output: c1, c2, c3

This plan should be able to give the correct results. We are still
thinking if it is a better plan than the 'multiplexed pipe' plan as in
v3. Inputs of thoughts here would be appreciated.

Ha, I believe you meant to say a "normal aggregate", because what's
performed above gather is no longer "grouping sets", right?

The group key idea is clever in that it helps "discriminate" tuples by
their grouping set id. I haven't completely thought this through, but my
hunch is that this leaves some money on the table, for example, won't it
also lead to more expensive (and unnecessary) sorting and hashing? The
groupings with a few partials are now sharing the same tuplesort with
the groupings with a lot of groups even though we only want to tell
grouping 1 *apart from* grouping 10, not neccessarily that grouping 1
needs to come before grouping 10. That's why I like the multiplexed pipe
/ "dispatched by grouping set id" idea: we only pay for sorting (or
hashing) within each grouping. That said, I'm open to the criticism that
keeping multiple tuplesort and agg hash tabes running is expensive in
itself, memory-wise ...

Cheers,
Jesse

#21

Pengzhou Tang

ptang@pivotal.io

almost 6 years ago

In reply to: Jesse Zhang (#20)

Re: Parallel grouping sets

Thanks to reviewing those patches.

Ha, I believe you meant to say a "normal aggregate", because what's

performed above gather is no longer "grouping sets", right?

The group key idea is clever in that it helps "discriminate" tuples by
their grouping set id. I haven't completely thought this through, but my
hunch is that this leaves some money on the table, for example, won't it
also lead to more expensive (and unnecessary) sorting and hashing? The
groupings with a few partials are now sharing the same tuplesort with
the groupings with a lot of groups even though we only want to tell
grouping 1 *apart from* grouping 10, not neccessarily that grouping 1
needs to come before grouping 10. That's why I like the multiplexed pipe
/ "dispatched by grouping set id" idea: we only pay for sorting (or
hashing) within each grouping. That said, I'm open to the criticism that
keeping multiple tuplesort and agg hash tabes running is expensive in
itself, memory-wise ...

Cheers,
Jesse

That's something we need to testing, thanks. Meanwhile, for the approach to
use "normal aggregate" with grouping set id, one concern is that it cannot
use
Mixed Hashed which means if a grouping sets contain both non-hashable or
non-sortable sets, it will fallback to one-phase aggregate.

#22

Richard Guo

guofenglinux@gmail.com

almost 6 years ago

In reply to: Pengzhou Tang (#21)

Re: Parallel grouping sets

To summarize the current state of parallel grouping sets, we now have
two available implementations for it.

1) Each worker performs an aggregation step, producing a partial result
for each group of which that process is aware. Then the partial results
are gathered to the leader, which then performs a grouping sets
aggregation, as in patch [1]/messages/by-id/CAN_9JTx3NM12ZDzEYcOVLFiCBvwMHyM0gENvtTpKBoOOgcs=kw@mail.gmail.com.

This implementation is not very efficient sometimes, because the group
key for Partial Aggregate has to be all the columns involved in the
grouping sets.

2) Each worker performs a grouping sets aggregation on its partial
data, and tags 'GroupingSetId' for each tuple produced by partial
aggregate. Then the partial results are gathered to the leader, and the
leader performs a modified grouping aggregate, which dispatches the
partial results into different pipe according to 'GroupingSetId', as in
patch [2]/messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com, or instead as another method, the leader performs a normal
aggregation, with 'GroupingSetId' included in the group keys, as
discussed in [3]/messages/by-id/CAN_9JTwtzttEmdXvMbJqXt=51kXiBTCKEPKq6kk2PZ6Xz6m5ig@mail.gmail.com.

The second implementation would be generally better than the first one
in performance, and we have decided to concentrate on it.

[1]: /messages/by-id/CAN_9JTx3NM12ZDzEYcOVLFiCBvwMHyM0gENvtTpKBoOOgcs=kw@mail.gmail.com
/messages/by-id/CAN_9JTx3NM12ZDzEYcOVLFiCBvwMHyM0gENvtTpKBoOOgcs=kw@mail.gmail.com
[2]: /messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com
/messages/by-id/CAN_9JTwtTTnxhbr5AHuqVcriz3HxvPpx1JWE--DCSdJYuHrLtA@mail.gmail.com
[3]: /messages/by-id/CAN_9JTwtzttEmdXvMbJqXt=51kXiBTCKEPKq6kk2PZ6Xz6m5ig@mail.gmail.com
/messages/by-id/CAN_9JTwtzttEmdXvMbJqXt=51kXiBTCKEPKq6kk2PZ6Xz6m5ig@mail.gmail.com

Thanks
Richard

Show quoted text

#23

Pengzhou Tang

ptang@pivotal.io

almost 6 years ago

In reply to: Richard Guo (#22)

4 attachment(s)

Re: Parallel grouping sets

Hi there,

We want to update our work on the parallel groupingsets, the attached
patchset implements parallel grouping sets with the strategy proposed in
/messages/by-id/CAG4reARMcyn+X8gGRQEZyt32NoHc9MfznyPsg_C_V9G+dnQ15Q@mail.gmail.com

It contains some refinement of our code and adds LLVM support. It also
contains a few patches refactoring the grouping sets code to make the
parallel grouping sets implementation cleaner.

Like simple parallel aggregate, we separate the process of grouping sets
into two stages:

*The partial stage: *
the partial stage is much the same as the current grouping sets
implementation, the differences are:
- In the partial stage, like in regular parallel aggregation, only partial
aggregate results (e.g. transvalues) are produced.
- The output of the partial stage includes a grouping set ID to allow for
disambiguation during the final stage

The optimizations of the existing grouping sets implementation are
preserved during the partial stage, like:
- Grouping sets that can be combined in one rollup are still grouped
together (for group agg).
- Hashaggs can be performed concurrently with the first group agg.
- All hash transitions can be done in one expression state.

*The final stage*:
In the final stage, the partial aggregate results are combined according to
the grouping set id. None of the optimizations of the partial stage can be
leveraged in the final stage. So all rollups are extracted and each rollup
contains only one grouping set, each aggregate phase processes a single
grouping set. In this stage, tuples are multiplexed into the different
phases
according to the grouping set id before we actually aggregate it.

An alternative approach to the final stage implementation that we considered
was using a single AGG with grouping clause: gsetid + all grouping columns.
In the end, we decided against it because it doesn't support mixed
aggregation,
firstly, once the grouping columns are a mix of unsortable and unhashable
columns, it cannot produce a path in the final stage, secondly, mixed
aggregation
is the cheapest path in some cases and this way can not support it.
Meanwhile,
if the union of all the grouping columns is large, this parallel implementation
will
incur undue costs.

The patches included in this patchset are as follows:

0001-All-grouping-sets-do-their-own-sorting.patch

This is a refactoring patch for the existing code. It moves the phase 0 SORT
into the AGG instead of assuming that the input is already sorted.

Postgres used to add a SORT path explicitly beneath the AGG for sort group
aggregate. Grouping sets path also adds a SORT path for the first sort
aggregate phase but the following sort aggregate phases do their own sorting
using a tuplesort. This commit unifies the way grouping sets paths do
sorting,
all sort aggregate phases now do their own sorting using tuplesort.

We did this refactoring to support the final stage of parallel grouping
sets.
Adding a SORT path underneath the AGG in the final stage is wasteful. With
this patch, all non-hashed aggregate phases can do their own sorting after
the tuples are redirected.

Unpatched:
tpch=# explain (costs off) select count(*) from customer group by grouping
sets (c_custkey, c_name);
QUERY PLAN
----------------------------------
GroupAggregate
Group Key: c_custkey
Sort Key: c_name
Group Key: c_name
-> Sort
Sort Key: c_custkey
-> Seq Scan on customer

Patched:
tpch=# explain (costs off) select count(*) from customer group by grouping
sets (c_custkey, c_name);
QUERY PLAN
----------------------------
GroupAggregate
Sort Key: c_custkey
Group Key: c_custkey
Sort Key: c_name
Group Key: c_name
-> Seq Scan on customer

0002-fix-a-numtrans-bug.patch

Bugfix for the additional size of the hash table for hash aggregate,
the additional
size is always zero.
/messages/by-id/CAG4reATfHUFVek4Hj6t2oDMqW=K02JBWLbURNSpftPhL5XrNRQ@mail.gmail.com

0003-Reorganise-the-aggregate-phases.patch

Planner used to organize the grouping sets in [HASHED]->[SORTED] order.
HASHED aggregates were always located before SORTED aggregate. And
ExecInitAgg() organized the aggregate phases in [HASHED]->[SORTED] order.
All HASHED grouping sets are squeezed into phase 0 when executing the
AGG node. For AGG_HASHED or AGG_MIXED strategies, however, the executor
will start from executing phase 1-3 assuming they are all groupaggs and then
return to phase 0 to execute hashaggs if it is AGG_MIXED.

When adding support for parallel grouping sets, this was a big barrier.
Firstly, we needed complicated logic to locate the first sort rollup/phase
and
handle the special order for a differentstrategy in many places.

Secondly, squeezing all hashed grouping sets to phase 0 doesn't work for the
final stage. We can't put all transition functions into one expression
state in the
final stage. ExecEvalExpr() is optimized to evaluate all the hashed grouping
sets for the same tuple, however, each input to the final stage is a trans
value,
so you inherently should not evaluate more than one grouping set for the
same input.

This commit organizes the grouping sets in a more natural way:
[SORTED]->[HASHED].

The executor now starts execution from phase 0 for all strategies, the
HASHED
sets are no longer squeezed into a single phase. Instead, a HASHED set has
its
own phase and we use other ways to put all hash transitions in one
expression
state for the partial stage.

This commit also moves 'sort_in' from the AggState to the AggStatePerPhase*
structure, this helps to handle more complicated cases necessitated by the
introduction of parallel grouping sets. For example, we might need to add a
tuplestore 'store_in' to store partial aggregates results for PLAIN sets
then.

It also gives us a chance to keep the first TupleSortState, so we do not do
a resort
when rescanning.

0004-Parallel-grouping-sets.patch

This is the main logic. Patch 0001 and 0003 allow it to be pretty simple.

Here is an example plan with the patch applied:
tpch=# explain (costs off) select sum(l_quantity) as sum_qty, count(*) as
count_order from lineitem group by grouping sets((l_returnflag,
l_linestatus), (), l_suppkey);
QUERY PLAN
----------------------------------------------------
Finalize MixedAggregate
Filtered by: (GROUPINGSETID())
Sort Key: l_suppkey
Group Key: l_suppkey
Group Key: ()
Hash Key: l_returnflag, l_linestatus
-> Gather
Workers Planned: 7
-> Partial MixedAggregate
Sort Key: l_suppkey
Group Key: l_suppkey
Group Key: ()
Hash Key: l_returnflag, l_linestatus
-> Parallel Seq Scan on lineitem
(14 rows)

We have done some performance tests as well using a groupingsets-enhanced
subset of TPCH. TPCH didn't contain grouping sets queries, so we changed all
"group by" clauses to "group by rollup" clauses. We chose 14 queries the
test.

We noticed no performance regressions. 3 queries showed performance
improvements
due to parallelism: (tpch scale is 10 and max_parallel_workers_per_gather
is 8)

1.sql: 16150.780 ms vs 116093.550 ms
13.sql: 5288.635 ms vs 19541.981 ms
18.sql: 52985.084 ms vs 67980.856 ms

Thanks,
Pengzhou & Melanie & Jesse

Attachments:

0001-All-grouping-sets-do-their-own-sorting.patchapplication/x-patch; name=0001-All-grouping-sets-do-their-own-sorting.patchDownload

From f149cc81f093b266bd4c53a390d0c761d0415ac0 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:07:29 -0400
Subject: [PATCH 1/4] All grouping sets do their own sorting
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PG used to add a SORT path explicitly beneath the AGG for sort aggregate,
grouping sets path also add a SORT path for the first sort aggregate phase,
but the following sort aggregate phases do their own sorting using a tuplesort.
This commit unified the way how grouping sets path doing sort, all sort aggregate
phases now do their own sorting using tuplesort.

This commit is mainly a preparing step to support parallel grouping sets, the
main idea of parallel grouping sets is: like parallel aggregate,  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multiple grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

The final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizations in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.

Obviously, adding a SORT path underneath the AGG in the final stage is not
right. This commit can avoid it and all non-hashed aggregate phases can do
their own sorting after thetuples are redirected.
---
 src/backend/commands/explain.c             |   5 +-
 src/backend/executor/nodeAgg.c             |  79 +++++++++++++++---
 src/backend/nodes/copyfuncs.c              |   1 +
 src/backend/nodes/outfuncs.c               |   1 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/createplan.c    |  65 +++++++++++----
 src/backend/optimizer/plan/planner.c       |  66 ++++++++++-----
 src/backend/optimizer/util/pathnode.c      |  30 ++++++-
 src/include/executor/nodeAgg.h             |   2 -
 src/include/nodes/execnodes.h              |   5 +-
 src/include/nodes/pathnodes.h              |   1 +
 src/include/nodes/plannodes.h              |   1 +
 src/include/optimizer/pathnode.h           |   3 +-
 src/include/optimizer/planmain.h           |   2 +-
 src/test/regress/expected/groupingsets.out | 130 +++++++++++++----------------
 15 files changed, 260 insertions(+), 132 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..b1609b339a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2289,15 +2289,14 @@ show_grouping_sets(PlanState *planstate, Agg *agg,
 
 	ExplainOpenGroup("Grouping Sets", "Grouping Sets", false, es);
 
-	show_grouping_set_keys(planstate, agg, NULL,
+	show_grouping_set_keys(planstate, agg, (Sort *) agg->sortnode,
 						   context, useprefix, ancestors, es);
 
 	foreach(lc, agg->chain)
 	{
 		Agg		   *aggnode = lfirst(lc);
-		Sort	   *sortnode = (Sort *) aggnode->plan.lefttree;
 
-		show_grouping_set_keys(planstate, aggnode, sortnode,
+		show_grouping_set_keys(planstate, aggnode, (Sort *) aggnode->sortnode,
 							   context, useprefix, ancestors, es);
 	}
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7aebb247d8..b4f53bf77a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -278,6 +278,7 @@ static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
 static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
@@ -367,7 +368,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
+		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -1594,6 +1595,8 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (!node->input_sorted)
+					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -1945,6 +1948,45 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static void
+agg_sort_input(AggState *aggstate)
+{
+	AggStatePerPhase phase = &aggstate->phases[1];
+	TupleDesc	tupDesc;
+	Sort		*sortnode;
+
+	Assert(!aggstate->input_sorted);
+	Assert(phase->aggnode->sortnode);
+
+	sortnode = (Sort *) phase->aggnode->sortnode;
+	tupDesc = ExecGetResultType(outerPlanState(aggstate));
+
+	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
+											 sortnode->numCols,
+											 sortnode->sortColIdx,
+											 sortnode->sortOperators,
+											 sortnode->collations,
+											 sortnode->nullsFirst,
+											 work_mem,
+											 NULL, false);
+	for (;;)
+	{
+		TupleTableSlot *outerslot;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+	}
+
+	/* Sort the first phase */
+	tuplesort_performsort(aggstate->sort_in);
+
+	/* Mark the input to be sorted */
+	aggstate->input_sorted = true;
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -2127,6 +2169,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
+	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
@@ -2171,6 +2214,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_sorted = true;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -2178,6 +2222,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numPhases = (use_hashing ? 1 : 2);
 	numHashes = (use_hashing ? 1 : 0);
 
+	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.  Also calculate the number of
@@ -2199,7 +2245,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * others add an extra phase.
 			 */
 			if (agg->aggstrategy != AGG_HASHED)
+			{
 				++numPhases;
+
+				if (!firstSortAgg)
+					firstSortAgg = agg;
+
+			}
 			else
 				++numHashes;
 		}
@@ -2208,6 +2260,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = numPhases;
 
+	/*
+	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * agg_sort_input(), this can only happen in groupingsets case.
+	 */
+	if (firstSortAgg && firstSortAgg->sortnode)
+		aggstate->input_sorted = false;	
+
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
@@ -2269,7 +2328,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		!aggstate->input_sorted)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2340,20 +2400,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
 	{
 		Agg		   *aggnode;
-		Sort	   *sortnode;
 
 		if (phaseidx > 0)
-		{
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
-			sortnode = castNode(Sort, aggnode->plan.lefttree);
-		}
 		else
-		{
 			aggnode = node;
-			sortnode = NULL;
-		}
-
-		Assert(phase <= 1 || sortnode);
 
 		if (aggnode->aggstrategy == AGG_HASHED
 			|| aggnode->aggstrategy == AGG_MIXED)
@@ -2470,7 +2521,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
-			phasedata->sortnode = sortnode;
 		}
 	}
 
@@ -3559,6 +3609,10 @@ ExecReScanAgg(AggState *node)
 				   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
+		/* Reset input_sorted */
+		if (aggnode->sortnode)
+			node->input_sorted = false;
+
 		/* reset to phase 1 */
 		initialize_phase(node, 1);
 
@@ -3566,6 +3620,7 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
+
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index e04c33e4ad..20ed43604e 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -992,6 +992,7 @@ _copyAgg(const Agg *from)
 	COPY_BITMAPSET_FIELD(aggParams);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..5816d122c1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -787,6 +787,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_BITMAPSET_FIELD(aggParams);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(sortnode);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..af4fcfe1ee 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2207,6 +2207,7 @@ _readAgg(void)
 	READ_BITMAPSET_FIELD(aggParams);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..d5b34089aa 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1645,6 +1645,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 NIL,
 								 best_path->path.rows,
 								 0,
+								 NULL,
 								 subplan);
 	}
 	else
@@ -2098,6 +2099,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
+					NULL,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2159,6 +2161,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
 	int			maxref;
+	int			flags = CP_LABEL_TLIST;
 	List	   *chain;
 	ListCell   *lc;
 
@@ -2168,9 +2171,15 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available
+	 * we do need grouping columns to be available; If the groupingsets need
+	 * to sort the input, the agg will store the input rows in a tuplesort,
+	 * it therefore behooves us to request a small tlist to avoid wasting
+	 * spaces.
 	 */
-	subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST);
+	if (!best_path->is_sorted)
+		flags = flags | CP_SMALL_TLIST;
+
+	subplan = create_plan_recurse(root, best_path->subpath, flags);
 
 	/*
 	 * Compute the mapping from tleSortGroupRef to column index in the child's
@@ -2230,12 +2239,22 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-			if (!rollup->is_hashed && !is_first_sort)
+			if (!rollup->is_hashed)
 			{
-				sort_plan = (Plan *)
-					make_sort_from_groupcols(rollup->groupClause,
-											 new_grpColIdx,
-											 subplan);
+				if (!is_first_sort ||
+					(is_first_sort && !best_path->is_sorted))
+				{
+					sort_plan = (Plan *)
+						make_sort_from_groupcols(rollup->groupClause,
+												 new_grpColIdx,
+												 subplan);
+
+					/*
+					 * Remove stuff we don't need to avoid bloating debug output.
+					 */
+					sort_plan->targetlist = NIL;
+					sort_plan->lefttree = NULL;
+				}
 			}
 
 			if (!rollup->is_hashed)
@@ -2260,16 +2279,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
-										 sort_plan);
-
-			/*
-			 * Remove stuff we don't need to avoid bloating debug output.
-			 */
-			if (sort_plan)
-			{
-				sort_plan->targetlist = NIL;
-				sort_plan->lefttree = NULL;
-			}
+										 sort_plan,
+										 NULL);
 
 			chain = lappend(chain, agg_plan);
 		}
@@ -2281,10 +2292,26 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
+		Plan	   *sort_plan = NULL;
 		int			numGroupCols;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+		/* the input is not sorted yet */
+		if (!rollup->is_hashed &&
+			!best_path->is_sorted)
+		{
+			sort_plan = (Plan *)
+				make_sort_from_groupcols(rollup->groupClause,
+										 top_grpColIdx,
+										 subplan);
+			/*
+			 * Remove stuff we don't need to avoid bloating debug output.
+			 */
+			sort_plan->targetlist = NIL;
+			sort_plan->lefttree = NULL;
+		}
+
 		numGroupCols = list_length((List *) linitial(rollup->gsets));
 
 		plan = make_agg(build_path_tlist(root, &best_path->path),
@@ -2299,6 +2326,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
+						sort_plan,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
@@ -6197,7 +6225,7 @@ make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 		 List *groupingSets, List *chain, double dNumGroups,
-		 Size transitionSpace, Plan *lefttree)
+		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
 	Plan	   *plan = &node->plan;
@@ -6217,6 +6245,7 @@ make_agg(List *tlist, List *qual,
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
 	node->groupingSets = groupingSets;
 	node->chain = chain;
+	node->sortnode = sortnode;
 
 	plan->qual = qual;
 	plan->targetlist = tlist;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..82a15761b4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -175,7 +175,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggStrategy strat);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -4186,6 +4187,14 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * it, by combinations of hashing and sorting.  This can be called multiple
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
+ *
+ * - strat:
+ *   preferred aggregate strategy to use.
+ * 
+ * - is_sorted:
+ *   Is the input sorted on the groupCols of the first rollup. Caller
+ *   must set it correctly if strat is set to AGG_SORTED, the planner
+ *   uses it to generate a sortnode.
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4195,13 +4204,15 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggStrategy strat)
 {
 	Query	   *parse = root->parse;
+	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
-	 * If we're not being offered sorted input, then only consider plans that
-	 * can be done entirely by hashing.
+	 * If strat is AGG_HASHED, then only consider plans that can be done
+	 * entirely by hashing.
 	 *
 	 * We can hash everything if it looks like it'll fit in work_mem. But if
 	 * the input is actually sorted despite not being advertised as such, we
@@ -4210,7 +4221,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * If none of the grouping sets are sortable, then ignore the work_mem
 	 * limit and generate a path anyway, since otherwise we'll just fail.
 	 */
-	if (!is_sorted)
+	if (strat == AGG_HASHED)
 	{
 		List	   *new_rollups = NIL;
 		RollupData *unhashed_rollup = NULL;
@@ -4251,6 +4262,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
+			/* update is_sorted to true */
+			is_sorted = true;
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4348,6 +4361,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
+			/* update is_sorted to true */
+			is_sorted = true;
 			strat = AGG_MIXED;
 		}
 
@@ -4359,18 +4374,23 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  strat,
 										  new_rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 		return;
 	}
 
 	/*
-	 * If we have sorted input but nothing we can do with it, bail.
+	 * Strategy is AGG_SORTED but nothing we can do with it, bail.
 	 */
 	if (list_length(gd->rollups) == 0)
 		return;
 
 	/*
-	 * Given sorted input, we try and make two paths: one sorted and one mixed
+	 * Callers consider AGG_SORTED strategy, the first rollup must
+	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * rollup need to do its own sort.
+	 *
+	 * we try and make two paths: one sorted and one mixed
 	 * sort/hash. (We need to try both because hashagg might be disabled, or
 	 * some columns might not be sortable.)
 	 *
@@ -4427,7 +4447,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that matches the input sort order.  We assign indexes "i"
+			 * one that need to be sorted.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -4516,7 +4536,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  AGG_MIXED,
 											  rollups,
 											  agg_costs,
-											  dNumGroups));
+											  dNumGroups,
+											  is_sorted));
 		}
 	}
 
@@ -4532,7 +4553,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  AGG_SORTED,
 										  gd->rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 }
 
 /*
@@ -6399,6 +6421,16 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					/* consider AGG_SORTED strategy */
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_costs, dNumGroups,
+												AGG_SORTED);
+					continue;
+				}
+
 				/* Sort the cheapest-total path if it isn't already sorted */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6407,14 +6439,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 root->group_pathkeys,
 													 -1.0);
 
-				/* Now decide what to stick atop it */
-				if (parse->groupingSets)
-				{
-					consider_groupingsets_paths(root, grouped_rel,
-												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
-				}
-				else if (parse->hasAggs)
+				if (parse->hasAggs)
 				{
 					/*
 					 * We have aggregation, possibly with plain GROUP BY. Make
@@ -6514,7 +6539,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										AGG_HASHED);
 		}
 		else
 		{
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..0feb3363d3 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2983,6 +2983,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
+ * 'is_sorted' is the input sorted in the group cols of first rollup
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -2992,7 +2993,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3010,6 +3012,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->is_sorted = is_sorted;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
@@ -3061,14 +3064,33 @@ create_groupingsets_path(PlannerInfo *root,
 		 */
 		if (is_first)
 		{
+			Cost	input_startup_cost = subpath->startup_cost;
+			Cost	input_total_cost = subpath->total_cost;
+
+			if (!rollup->is_hashed && !is_sorted && numGroupCols)
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				cost_sort(&sort_path, root, NIL,
+						  input_total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0);
+
+				input_startup_cost = sort_path.startup_cost;
+				input_total_cost = sort_path.total_cost;
+			}
+
 			cost_agg(&pathnode->path, root,
 					 aggstrategy,
 					 agg_costs,
 					 numGroupCols,
 					 rollup->numGroups,
 					 having_qual,
-					 subpath->startup_cost,
-					 subpath->total_cost,
+					 input_startup_cost,
+					 input_total_cost,
 					 subpath->rows);
 			is_first = false;
 			if (!rollup->is_hashed)
@@ -3079,7 +3101,7 @@ create_groupingsets_path(PlannerInfo *root,
 			Path		sort_path;	/* dummy for result of cost_sort */
 			Path		agg_path;	/* dummy for result of cost_agg */
 
-			if (rollup->is_hashed || is_first_sort)
+			if (rollup->is_hashed || (is_first_sort && is_sorted))
 			{
 				/*
 				 * Account for cost of aggregation, but don't charge input
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a9..66a83b9ac9 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -277,8 +277,6 @@ typedef struct AggStatePerPhaseData
 	ExprState **eqfunctions;	/* expression returning equality, indexed by
 								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
-	Sort	   *sortnode;		/* Sort node for input ordering for phase */
-
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
 }			AggStatePerPhaseData;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..5e33a368f5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2083,8 +2083,11 @@ typedef struct AggState
 	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
 										 * per-group pointers */
 
+	/* these fields are used in AGG_SORTED and AGG_MIXED */
+	bool		input_sorted;	/* hash table filled yet? */
+
 	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 35
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..c1e69c808f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1702,6 +1702,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..3cd2537e9e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -818,6 +818,7 @@ typedef struct Agg
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
 	List	   *groupingSets;	/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
 /* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..f9f388ba06 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 4781201001..5954ff3997 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 					 List *groupingSets, List *chain, double dNumGroups,
-					 Size transitionSpace, Plan *lefttree);
+					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
 /*
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a..12425f46ca 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -366,15 +366,14 @@ explain (costs off)
 select g as alias1, g as alias2
   from generate_series(1,3) g
  group by alias1, rollup(alias2);
-                   QUERY PLAN                   
-------------------------------------------------
+                QUERY PLAN                
+------------------------------------------
  GroupAggregate
-   Group Key: g, g
-   Group Key: g
-   ->  Sort
-         Sort Key: g
-         ->  Function Scan on generate_series g
-(6 rows)
+   Sort Key: g, g
+     Group Key: g, g
+     Group Key: g
+   ->  Function Scan on generate_series g
+(5 rows)
 
 select g as alias1, g as alias2
   from generate_series(1,3) g
@@ -640,15 +639,14 @@ select a, b, sum(v.x)
 -- Test reordering of grouping sets
 explain (costs off)
 select * from gstest1 group by grouping sets((a,b,v),(v)) order by v,b,a;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-   Group Key: "*VALUES*".column3
-   ->  Sort
-         Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-         ->  Values Scan on "*VALUES*"
-(6 rows)
+   Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3
+   ->  Values Scan on "*VALUES*"
+(5 rows)
 
 -- Agg level check. This query should error out.
 select (select grouping(a,b) from gstest2) from gstest2 group by a,b;
@@ -723,13 +721,12 @@ explain (costs off)
             QUERY PLAN            
 ----------------------------------
  GroupAggregate
-   Group Key: a
-   Group Key: ()
+   Sort Key: a
+     Group Key: a
+     Group Key: ()
    Filter: (a IS DISTINCT FROM 1)
-   ->  Sort
-         Sort Key: a
-         ->  Seq Scan on gstest2
-(7 rows)
+   ->  Seq Scan on gstest2
+(6 rows)
 
 select v.c, (select count(*) from gstest2 group by () having v.c)
   from (values (false),(true)) v(c) order by v.c;
@@ -1018,18 +1015,17 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
 explain (costs off)
   select a, b, grouping(a,b), array_agg(v order by v)
     from gstest1 group by cube(a,b);
-                        QUERY PLAN                        
-----------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column1, "*VALUES*".column2
-   Group Key: "*VALUES*".column1
-   Group Key: ()
+   Sort Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1
+     Group Key: ()
    Sort Key: "*VALUES*".column2
      Group Key: "*VALUES*".column2
-   ->  Sort
-         Sort Key: "*VALUES*".column1, "*VALUES*".column2
-         ->  Values Scan on "*VALUES*"
-(9 rows)
+   ->  Values Scan on "*VALUES*"
+(8 rows)
 
 -- unsortable cases
 select unsortable_col, count(*)
@@ -1071,11 +1067,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: unsortable_col
-         Group Key: unhashable_col
-         ->  Sort
-               Sort Key: unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: unhashable_col
+           Group Key: unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 select unhashable_col, unsortable_col,
        grouping(unhashable_col, unsortable_col),
@@ -1114,11 +1109,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: v, unsortable_col
-         Group Key: v, unhashable_col
-         ->  Sort
-               Sort Key: v, unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: v, unhashable_col
+           Group Key: v, unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 -- empty input: first is 0 rows, second 1, third 3 etc.
 select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a);
@@ -1366,19 +1360,18 @@ explain (costs off)
 BEGIN;
 SET LOCAL enable_hashagg = false;
 EXPLAIN (COSTS OFF) SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
-              QUERY PLAN               
----------------------------------------
+           QUERY PLAN            
+---------------------------------
  Sort
    Sort Key: a, b
    ->  GroupAggregate
-         Group Key: a
-         Group Key: ()
+         Sort Key: a
+           Group Key: a
+           Group Key: ()
          Sort Key: b
            Group Key: b
-         ->  Sort
-               Sort Key: a
-               ->  Seq Scan on gstest3
-(10 rows)
+         ->  Seq Scan on gstest3
+(9 rows)
 
 SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
  a | b | count | max | max 
@@ -1549,22 +1542,21 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(13 rows)
+   ->  Seq Scan on tenk1
+(12 rows)
 
 explain (costs off)
   select unique1,
@@ -1572,18 +1564,17 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+       QUERY PLAN        
+-------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(9 rows)
+   Sort Key: unique1
+     Group Key: unique1
+   ->  Seq Scan on tenk1
+(8 rows)
 
 set work_mem = '384kB';
 explain (costs off)
@@ -1592,21 +1583,20 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
    Hash Key: thousand
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(12 rows)
+   ->  Seq Scan on tenk1
+(11 rows)
 
 -- check collation-sensitive matching between grouping expressions
 -- (similar to a check for aggregates, but there are additional code
-- 
2.14.1

0002-fix-a-numtrans-bug.patchapplication/octet-stream; name=0002-fix-a-numtrans-bug.patchDownload

From 030754e8e7303b3044ed54a28d6f07fa2f56f5dc Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Thu, 12 Mar 2020 04:38:36 -0400
Subject: [PATCH 2/4] fix a numtrans bug

aggstate->numtrans is always zero when building the hash table for
hash aggregates, this make the additional size of hash table not
correct.
---
 src/backend/executor/nodeAgg.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b4f53bf77a..cee51fe636 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2570,10 +2570,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	{
 		/* this is an array of pointers, not structures */
 		aggstate->hash_pergroup = pergroups;
-
-		find_hash_columns(aggstate);
-		build_hash_tables(aggstate);
-		aggstate->table_filled = false;
 	}
 
 	/*
@@ -2929,6 +2925,14 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	/* Initialize hash tables for hash aggregates */
+	if (use_hashing)
+	{
+		find_hash_columns(aggstate);
+		build_hash_tables(aggstate);
+		aggstate->table_filled = false;
+	}
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
-- 
2.14.1

0003-Reorganise-the-aggregate-phases.patchapplication/octet-stream; name=0003-Reorganise-the-aggregate-phases.patchDownload

From 1feb6543f424c576a7697026e65f73cb6e24b405 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:13:44 -0400
Subject: [PATCH 3/4] Reorganise the aggregate phases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This commit is a preparing step to support parallel grouping sets.

When planning, PG used to organize the grouping sets in [HASHED] -> [SORTED]
order which means HASHED aggregates were always located before SORTED aggregate,
when initializing AGG node, PG also organized the aggregate phases in
[HASHED]->[SORTED] order, all HASHED grouping sets were squeezed to the phase 0,
when executing AGG node, if followed AGG_SORTED or AGG_MIXED strategy, the
executor will start from phase1 -> phases2-> phases3 then phase0 if it is an
AGG_MIXED strategy. This bothers a lot when adding the support for parallel
grouping sets, firstly, we need complicated logic to locate the first sort
rollup/phase and handle the special order for a different strategy in many
places, Secondly, squeezing all hashed grouping sets to phase 0 is not working
for parallel grouping sets, we can not put all hash transition functions to one
expression state in the final stage.

This commit organizes the grouping sets in a more natural order: [SORTED]->[HASHED]
and the HASHED sets are no longer squeezed to a single phase, we use another way
to put all hash transitions to the first phase's expression state, the executor
now starts execution from phase0 for all strategies.

This commit also move 'sort_in' from AggState to AggStatePerPhase* structure,
this helps to handle more complicated cases when parallel groupingsets is
introduced, we might need to add a tuplestore 'store_in' to store partial
aggregates results for PLAIN sets then.
---
 src/backend/commands/explain.c                    |   2 +-
 src/backend/executor/execExpr.c                   |  58 +-
 src/backend/executor/execExprInterp.c             |  30 +-
 src/backend/executor/nodeAgg.c                    | 718 +++++++++++-----------
 src/backend/jit/llvm/llvmjit_expr.c               |  51 +-
 src/backend/optimizer/plan/createplan.c           |  29 +-
 src/backend/optimizer/plan/planner.c              |   9 +-
 src/backend/optimizer/util/pathnode.c             |  71 +--
 src/include/executor/execExpr.h                   |   5 +-
 src/include/executor/executor.h                   |   2 +-
 src/include/executor/nodeAgg.h                    |  26 +-
 src/include/nodes/execnodes.h                     |  18 +-
 src/test/regress/expected/groupingsets.out        |  38 +-
 src/test/regress/expected/partition_aggregate.out |   2 +-
 14 files changed, 529 insertions(+), 530 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b1609b339a..2c63cdb46c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2317,7 +2317,7 @@ show_grouping_set_keys(PlanState *planstate,
 	const char *keyname;
 	const char *keysetname;
 
-	if (aggnode->aggstrategy == AGG_HASHED || aggnode->aggstrategy == AGG_MIXED)
+	if (aggnode->aggstrategy == AGG_HASHED)
 	{
 		keyname = "Hash Key";
 		keysetname = "Hash Keys";
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 8c5ead93d6..de76f296b3 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -79,7 +79,7 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-								  int transno, int setno, int setoff, bool ishash,
+								  int transno, int setno, AggStatePerPhase perphase,
 								  bool nullcheck);
 
 
@@ -2930,13 +2930,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  * the array of AggStatePerGroup, and skip evaluation if so.
  */
 ExprState *
-ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
+	ListCell	*lc;
 	LastAttnumInfo deform = {0, 0, 0};
 
 	state->expr = (Expr *) aggstate;
@@ -3154,38 +3154,24 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * grouping set). Do so for both sort and hash based computations, as
 		 * applicable.
 		 */
-		if (doSort)
+		for (int setno = 0; setno < phase->numsets; setno++)
 		{
-			int			processGroupingSets = Max(phase->numsets, 1);
-			int			setoff = 0;
-
-			for (int setno = 0; setno < processGroupingSets; setno++)
-			{
-				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
-				setoff++;
-			}
+			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
+								  pertrans, transno, setno, phase, nullcheck);
 		}
 
-		if (doHash)
+		/*
+		 * Call transition function for HASHED aggs that can be
+		 * advanced concurrently.
+		 */
+		foreach(lc, phase->concurrent_hashes)
 		{
-			int			numHashes = aggstate->num_hashes;
-			int			setoff;
+			AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
 
-			/* in MIXED mode, there'll be preceding transition values */
-			if (aggstate->aggstrategy != AGG_HASHED)
-				setoff = aggstate->maxsets;
-			else
-				setoff = 0;
-
-			for (int setno = 0; setno < numHashes; setno++)
-			{
-				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, true,
-									  nullcheck);
-				setoff++;
-			}
+			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
+								  pertrans, transno, 0,
+								  (AggStatePerPhase) perhash,
+								  nullcheck);
 		}
 
 		/* adjust early bail out jump target(s) */
@@ -3233,14 +3219,17 @@ static void
 ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-					  int transno, int setno, int setoff, bool ishash,
+					  int transno, int setno, AggStatePerPhase perphase,
 					  bool nullcheck)
 {
 	ExprContext *aggcontext;
 	int adjust_jumpnull = -1;
 
-	if (ishash)
+	if (perphase->is_hashed)
+	{
+		Assert(setno == 0);
 		aggcontext = aggstate->hashcontext;
+	}
 	else
 		aggcontext = aggstate->aggcontexts[setno];
 
@@ -3248,9 +3237,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	if (nullcheck)
 	{
 		scratch->opcode = EEOP_AGG_PLAIN_PERGROUP_NULLCHECK;
-		scratch->d.agg_plain_pergroup_nullcheck.setoff = setoff;
+		scratch->d.agg_plain_pergroup_nullcheck.pergroups = perphase->pergroups;
 		/* adjust later */
 		scratch->d.agg_plain_pergroup_nullcheck.jumpnull = -1;
+		scratch->d.agg_plain_pergroup_nullcheck.setno = setno;
 		ExprEvalPushStep(state, scratch);
 		adjust_jumpnull = state->steps_len - 1;
 	}
@@ -3318,7 +3308,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 
 	scratch->d.agg_trans.pertrans = pertrans;
 	scratch->d.agg_trans.setno = setno;
-	scratch->d.agg_trans.setoff = setoff;
+	scratch->d.agg_trans.pergroups = perphase->pergroups;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
 	ExprEvalPushStep(state, scratch);
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 113ed1547c..b0dbba4e55 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1610,9 +1610,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 
 		EEO_CASE(EEOP_AGG_PLAIN_PERGROUP_NULLCHECK)
 		{
-			AggState   *aggstate = castNode(AggState, state->parent);
-			AggStatePerGroup pergroup_allaggs = aggstate->all_pergroups
-				[op->d.agg_plain_pergroup_nullcheck.setoff];
+			AggStatePerGroup pergroup_allaggs =
+				op->d.agg_plain_pergroup_nullcheck.pergroups
+				[op->d.agg_plain_pergroup_nullcheck.setno];
 
 			if (pergroup_allaggs == NULL)
 				EEO_JUMP(op->d.agg_plain_pergroup_nullcheck.jumpnull);
@@ -1636,8 +1636,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1665,8 +1665,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1684,8 +1684,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1702,8 +1702,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1724,8 +1724,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1742,8 +1742,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index cee51fe636..25e6eea822 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -227,6 +227,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -263,7 +264,7 @@ static void finalize_partialaggregate(AggState *aggstate,
 									  AggStatePerAgg peragg,
 									  AggStatePerGroup pergroupstate,
 									  Datum *resultVal, bool *resultIsNull);
-static void prepare_hash_slot(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash);
 static void prepare_projection_slot(AggState *aggstate,
 									TupleTableSlot *slot,
 									int currentSet);
@@ -274,12 +275,13 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
-static void lookup_hash_entries(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, AggStatePerPhaseHash perhash);
+static void lookup_hash_entries(AggState *aggstate, List *perhashes);
+static void lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash,
+							  uint32 hash);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
@@ -310,7 +312,10 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 	 * ExecAggPlainTransByRef().
 	 */
 	if (is_hash)
+	{
+		Assert(setno == 0);
 		aggstate->curaggcontext = aggstate->hashcontext;
+	}
 	else
 		aggstate->curaggcontext = aggstate->aggcontexts[setno];
 
@@ -318,72 +323,75 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 }
 
 /*
- * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
+ * Switch to phase "newphase", which must either be 0 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
 {
-	Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);
+	AggStatePerPhase current_phase;
+	AggStatePerPhaseSort persort;
+
+	Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
+	
+	/* Don't use aggstate->phase here, it might not be initialized yet*/
+	current_phase = aggstate->phases[aggstate->current_phase];
 
 	/*
 	 * Whatever the previous state, we're now done with whatever input
-	 * tuplesort was in use.
+	 * tuplesort was in use, cleanup them.
+	 *
+	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * rescan in some cases without resorting the input again.
 	 */
-	if (aggstate->sort_in)
-	{
-		tuplesort_end(aggstate->sort_in);
-		aggstate->sort_in = NULL;
-	}
-
-	if (newphase <= 1)
+	if (!current_phase->is_hashed && aggstate->current_phase > 0)
 	{
-		/*
-		 * Discard any existing output tuplesort.
-		 */
-		if (aggstate->sort_out)
+		persort = (AggStatePerPhaseSort) current_phase;
+		if (persort->sort_in)
 		{
-			tuplesort_end(aggstate->sort_out);
-			aggstate->sort_out = NULL;
+			tuplesort_end(persort->sort_in);
+			persort->sort_in = NULL;
 		}
 	}
-	else
-	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
-		 */
-		aggstate->sort_in = aggstate->sort_out;
-		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
-	}
+
+	/* advance to next phase */
+	aggstate->current_phase = newphase;
+	aggstate->phase = aggstate->phases[newphase];
+
+	if (aggstate->phase->is_hashed)
+		return;
+
+	/* New phase is not hashed */
+	persort = (AggStatePerPhaseSort) aggstate->phase;
+
+	/* This is the right time to actually sort it. */
+	if (persort->sort_in)
+		tuplesort_performsort(persort->sort_in);
 
 	/*
-	 * If this isn't the last phase, we need to sort appropriately for the
+	 * If copy_out is set, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
-	{
-		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
-		PlanState  *outerNode = outerPlanState(aggstate);
-		TupleDesc	tupDesc = ExecGetResultType(outerNode);
-
-		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
-												  sortnode->numCols,
-												  sortnode->sortColIdx,
-												  sortnode->sortOperators,
-												  sortnode->collations,
-												  sortnode->nullsFirst,
-												  work_mem,
-												  NULL, false);
+	if (persort->copy_out)
+	{
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[newphase + 1];
+		Sort *sortnode = (Sort *) next->phasedata.aggnode->sortnode;
+		PlanState *outerNode = outerPlanState(aggstate);
+		TupleDesc tupDesc = ExecGetResultType(outerNode);
+
+		Assert(!next->phasedata.is_hashed);
+
+		if (!next->sort_in)
+			next->sort_in = tuplesort_begin_heap(tupDesc,
+												 sortnode->numCols,
+												 sortnode->sortColIdx,
+												 sortnode->sortOperators,
+												 sortnode->collations,
+												 sortnode->nullsFirst,
+												 work_mem,
+												 NULL, false);
 	}
-
-	aggstate->current_phase = newphase;
-	aggstate->phase = &aggstate->phases[newphase];
 }
 
 /*
@@ -398,12 +406,16 @@ static TupleTableSlot *
 fetch_input_tuple(AggState *aggstate)
 {
 	TupleTableSlot *slot;
+	AggStatePerPhaseSort current_phase;
+
+	Assert(!aggstate->phase->is_hashed);
+	current_phase = (AggStatePerPhaseSort) aggstate->phase;
 
-	if (aggstate->sort_in)
+	if (current_phase->sort_in)
 	{
 		/* make sure we check for interrupts in either path through here */
 		CHECK_FOR_INTERRUPTS();
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, false,
+		if (!tuplesort_gettupleslot(current_phase->sort_in, true, false,
 									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
@@ -411,8 +423,13 @@ fetch_input_tuple(AggState *aggstate)
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
-	if (!TupIsNull(slot) && aggstate->sort_out)
-		tuplesort_puttupleslot(aggstate->sort_out, slot);
+	if (!TupIsNull(slot) && current_phase->copy_out)
+	{
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[aggstate->current_phase + 1];
+		Assert(!next->phasedata.is_hashed);
+		tuplesort_puttupleslot(next->sort_in, slot);
+	}
 
 	return slot;
 }
@@ -518,7 +535,7 @@ initialize_aggregates(AggState *aggstate,
 					  int numReset)
 {
 	int			transno;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			setno = 0;
 	int			numTrans = aggstate->numtrans;
 	AggStatePerTrans transstates = aggstate->pertrans;
@@ -1046,10 +1063,9 @@ finalize_partialaggregate(AggState *aggstate,
  * hashslot. This is necessary to compute the hash or perform a lookup.
  */
 static void
-prepare_hash_slot(AggState *aggstate)
+prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash)
 {
 	TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	int				i;
 
@@ -1283,18 +1299,22 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 static void
 build_hash_tables(AggState *aggstate)
 {
-	int				setno;
+	int	phaseidx;
 
-	for (setno = 0; setno < aggstate->num_hashes; ++setno)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[setno];
+		AggStatePerPhase phase;
+		AggStatePerPhaseHash perhash;
 
-		Assert(perhash->aggnode->numGroups > 0);
+		phase = aggstate->phases[phaseidx];
+		if (!phase->is_hashed)
+			continue;
 
+		perhash = (AggStatePerPhaseHash) phase;
 		if (perhash->hashtable)
 			ResetTupleHashTable(perhash->hashtable);
 		else
-			build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+			build_hash_table(aggstate, perhash);
 	}
 }
 
@@ -1302,9 +1322,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, AggStatePerPhaseHash perhash)
 {
-	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext	metacxt = aggstate->ss.ps.state->es_query_cxt;
 	MemoryContext	hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
 	MemoryContext	tmpcxt	= aggstate->tmpcontext->ecxt_per_tuple_memory;
@@ -1328,8 +1347,8 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 		perhash->hashGrpColIdxHash,
 		perhash->eqfuncoids,
 		perhash->hashfunctions,
-		perhash->aggnode->grpCollations,
-		nbuckets,
+		perhash->phasedata.aggnode->grpCollations,
+		perhash->phasedata.aggnode->numGroups,
 		additionalsize,
 		metacxt,
 		hashcxt,
@@ -1367,23 +1386,29 @@ find_hash_columns(AggState *aggstate)
 {
 	Bitmapset  *base_colnos;
 	List	   *outerTlist = outerPlanState(aggstate)->plan->targetlist;
-	int			numHashes = aggstate->num_hashes;
 	EState	   *estate = aggstate->ss.ps.state;
 	int			j;
 
 	/* Find Vars that will be needed in tlist and qual */
 	base_colnos = find_unaggregated_cols(aggstate);
 
-	for (j = 0; j < numHashes; ++j)
+	for (j = 0; j < aggstate->numphases; ++j)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[j];
+		AggStatePerPhase perphase = aggstate->phases[j];
+		AggStatePerPhaseHash perhash;
 		Bitmapset  *colnos = bms_copy(base_colnos);
-		AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
+		Bitmapset  *grouped_cols = perphase->grouped_cols[0];
+		AttrNumber *grpColIdx = perphase->aggnode->grpColIdx;
 		List	   *hashTlist = NIL;
+		ListCell   *lc;
 		TupleDesc	hashDesc;
 		int			maxCols;
 		int			i;
 
+		if (!perphase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) perphase;
 		perhash->largestGrpColIdx = 0;
 
 		/*
@@ -1393,18 +1418,12 @@ find_hash_columns(AggState *aggstate)
 		 * there'd be no point storing them.  Use prepare_projection_slot's
 		 * logic to determine which.
 		 */
-		if (aggstate->phases[0].grouped_cols)
+		foreach(lc, aggstate->all_grouped_cols)
 		{
-			Bitmapset  *grouped_cols = aggstate->phases[0].grouped_cols[j];
-			ListCell   *lc;
-
-			foreach(lc, aggstate->all_grouped_cols)
-			{
-				int			attnum = lfirst_int(lc);
+			int			attnum = lfirst_int(lc);
 
-				if (!bms_is_member(attnum, grouped_cols))
-					colnos = bms_del_member(colnos, attnum);
-			}
+			if (!bms_is_member(attnum, grouped_cols))
+				colnos = bms_del_member(colnos, attnum);
 		}
 
 		/*
@@ -1460,7 +1479,7 @@ find_hash_columns(AggState *aggstate)
 		hashDesc = ExecTypeFromTL(hashTlist);
 
 		execTuplesHashPrepare(perhash->numCols,
-							  perhash->aggnode->grpOperators,
+							  perphase->aggnode->grpOperators,
 							  &perhash->eqfuncoids,
 							  &perhash->hashfunctions);
 		perhash->hashslot =
@@ -1497,10 +1516,9 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
  * When called, CurrentMemoryContext should be the per-query context. The
  * already-calculated hash value for the tuple must be specified.
  */
-static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash)
+static void 
+lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash, uint32 hash)
 {
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	TupleHashEntryData *entry;
 	bool		isnew;
@@ -1532,7 +1550,7 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
 		}
 	}
 
-	return entry->additional;
+	perhash->phasedata.pergroups[0] = entry->additional;
 }
 
 /*
@@ -1542,21 +1560,19 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
  * Be aware that lookup_hash_entry can reset the tmpcontext.
  */
 static void
-lookup_hash_entries(AggState *aggstate)
+lookup_hash_entries(AggState *aggstate, List *perhashes)
 {
-	int			numHashes = aggstate->num_hashes;
-	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
-	int			setno;
+	ListCell *lc;
 
-	for (setno = 0; setno < numHashes; setno++)
+	foreach (lc, perhashes)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[setno];
 		uint32			hash;
+		AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
 
-		select_current_set(aggstate, setno, true);
-		prepare_hash_slot(aggstate);
+		select_current_set(aggstate, 0, true);
+		prepare_hash_slot(aggstate, perhash);
 		hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
-		pergroup[setno] = lookup_hash_entry(aggstate, hash);
+		lookup_hash_entry(aggstate, perhash, hash);
 	}
 }
 
@@ -1589,12 +1605,11 @@ ExecAgg(PlanState *pstate)
 			case AGG_HASHED:
 				if (!node->table_filled)
 					agg_fill_hash_table(node);
-				/* FALLTHROUGH */
-			case AGG_MIXED:
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+			case AGG_MIXED:
 				if (!node->input_sorted)
 					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
@@ -1622,8 +1637,8 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->numsets > 0;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
 	int			numReset;
@@ -1640,7 +1655,7 @@ agg_retrieve_direct(AggState *aggstate)
 	tmpcontext = aggstate->tmpcontext;
 
 	peragg = aggstate->peragg;
-	pergroups = aggstate->pergroups;
+	pergroups = aggstate->phase->pergroups;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
 	/*
@@ -1698,25 +1713,32 @@ agg_retrieve_direct(AggState *aggstate)
 		{
 			if (aggstate->current_phase < aggstate->numphases - 1)
 			{
+				/* Advance to the next phase */
 				initialize_phase(aggstate, aggstate->current_phase + 1);
-				aggstate->input_done = false;
-				aggstate->projected_set = -1;
-				numGroupingSets = Max(aggstate->phase->numsets, 1);
-				node = aggstate->phase->aggnode;
-				numReset = numGroupingSets;
-			}
-			else if (aggstate->aggstrategy == AGG_MIXED)
-			{
-				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
-				 */
-				initialize_phase(aggstate, 0);
-				aggstate->table_filled = true;
-				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-									   &aggstate->perhash[0].hashiter);
-				select_current_set(aggstate, 0, true);
-				return agg_retrieve_hash_table(aggstate);
+
+				/* Check whether new phase is an AGG_HASHED */
+				if (!aggstate->phase->is_hashed)
+				{
+					aggstate->input_done = false;
+					aggstate->projected_set = -1;
+					numGroupingSets = aggstate->phase->numsets;
+					node = aggstate->phase->aggnode;
+					numReset = numGroupingSets;
+					pergroups = aggstate->phase->pergroups; 
+				}
+				else
+				{
+					AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) aggstate->phase;
+
+					/*
+					 * Mixed mode; we've output all the grouped stuff and have
+					 * full hashtables, so switch to outputting those.
+					 */
+					aggstate->table_filled = true;
+					ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
+					select_current_set(aggstate, 0, true);
+					return agg_retrieve_hash_table(aggstate);
+				}
 			}
 			else
 			{
@@ -1755,11 +1777,11 @@ agg_retrieve_direct(AggState *aggstate)
 		 */
 		tmpcontext->ecxt_innertuple = econtext->ecxt_outertuple;
 		if (aggstate->input_done ||
-			(node->aggstrategy != AGG_PLAIN &&
+			(aggstate->phase->aggnode->numCols > 0 &&
 			 aggstate->projected_set != -1 &&
 			 aggstate->projected_set < (numGroupingSets - 1) &&
 			 nextSetSize > 0 &&
-			 !ExecQualAndReset(aggstate->phase->eqfunctions[nextSetSize - 1],
+			 !ExecQualAndReset(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[nextSetSize - 1],
 							   tmpcontext)))
 		{
 			aggstate->projected_set += 1;
@@ -1862,13 +1884,13 @@ agg_retrieve_direct(AggState *aggstate)
 				for (;;)
 				{
 					/*
-					 * During phase 1 only of a mixed agg, we need to update
-					 * hashtables as well in advance_aggregates.
+					 * If current phase can do transition concurrently, we need
+					 * to update hashtables as well in advance_aggregates.
 					 */
-					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+					if (aggstate->phase->concurrent_hashes)
 					{
-						lookup_hash_entries(aggstate);
+						lookup_hash_entries(aggstate,
+											aggstate->phase->concurrent_hashes);
 					}
 
 					/* Advance the aggregates (or combine functions) */
@@ -1899,10 +1921,10 @@ agg_retrieve_direct(AggState *aggstate)
 					 * If we are grouping, check whether we've crossed a group
 					 * boundary.
 					 */
-					if (node->aggstrategy != AGG_PLAIN)
+					if (aggstate->phase->aggnode->numCols > 0)
 					{
 						tmpcontext->ecxt_innertuple = firstSlot;
-						if (!ExecQual(aggstate->phase->eqfunctions[node->numCols - 1],
+						if (!ExecQual(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[node->numCols - 1],
 									  tmpcontext))
 						{
 							aggstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
@@ -1951,24 +1973,31 @@ agg_retrieve_direct(AggState *aggstate)
 static void
 agg_sort_input(AggState *aggstate)
 {
-	AggStatePerPhase phase = &aggstate->phases[1];
+	AggStatePerPhase phase = aggstate->phases[0];
+	AggStatePerPhaseSort persort = (AggStatePerPhaseSort) phase;
 	TupleDesc	tupDesc;
 	Sort		*sortnode;
+	bool		randomAccess;
 
 	Assert(!aggstate->input_sorted);
+	Assert(!phase->is_hashed);
 	Assert(phase->aggnode->sortnode);
 
 	sortnode = (Sort *) phase->aggnode->sortnode;
 	tupDesc = ExecGetResultType(outerPlanState(aggstate));
-
-	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
-											 sortnode->numCols,
-											 sortnode->sortColIdx,
-											 sortnode->sortOperators,
-											 sortnode->collations,
-											 sortnode->nullsFirst,
-											 work_mem,
-											 NULL, false);
+	randomAccess = (aggstate->eflags & (EXEC_FLAG_REWIND |
+										EXEC_FLAG_BACKWARD |
+										EXEC_FLAG_MARK)) != 0;
+
+
+	persort->sort_in = tuplesort_begin_heap(tupDesc,
+											sortnode->numCols,
+											sortnode->sortColIdx,
+											sortnode->sortOperators,
+											sortnode->collations,
+											sortnode->nullsFirst,
+											work_mem,
+											NULL, randomAccess);
 	for (;;)
 	{
 		TupleTableSlot *outerslot;
@@ -1977,11 +2006,11 @@ agg_sort_input(AggState *aggstate)
 		if (TupIsNull(outerslot))
 			break;
 
-		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+		tuplesort_puttupleslot(persort->sort_in, outerslot);
 	}
 
 	/* Sort the first phase */
-	tuplesort_performsort(aggstate->sort_in);
+	tuplesort_performsort(persort->sort_in);
 
 	/* Mark the input to be sorted */
 	aggstate->input_sorted = true;
@@ -1993,8 +2022,14 @@ agg_sort_input(AggState *aggstate)
 static void
 agg_fill_hash_table(AggState *aggstate)
 {
+	AggStatePerPhaseHash currentphase;
 	TupleTableSlot *outerslot;
 	ExprContext *tmpcontext = aggstate->tmpcontext;
+	List *concurrent_hashes = aggstate->phase->concurrent_hashes;
+
+	/* Current phase must be the first phase */
+	Assert(aggstate->current_phase == 0);
+	currentphase = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
@@ -2002,15 +2037,25 @@ agg_fill_hash_table(AggState *aggstate)
 	 */
 	for (;;)
 	{
-		outerslot = fetch_input_tuple(aggstate);
+		int	hash;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
 		if (TupIsNull(outerslot))
 			break;
 
 		/* set up for lookup_hash_entries and advance_aggregates */
 		tmpcontext->ecxt_outertuple = outerslot;
 
-		/* Find or build hashtable entries */
-		lookup_hash_entries(aggstate);
+		/* Find hashtable entry of current phase */
+		select_current_set(aggstate, 0, true);
+		prepare_hash_slot(aggstate, currentphase);
+		hash = TupleHashTableHash(currentphase->hashtable, currentphase->hashslot);
+		lookup_hash_entry(aggstate, currentphase, hash);
+
+
+		/* Find or build hashtable entries of concurrent hash phases */
+		if (concurrent_hashes)
+			lookup_hash_entries(aggstate, concurrent_hashes);
 
 		/* Advance the aggregates (or combine functions) */
 		advance_aggregates(aggstate);
@@ -2025,8 +2070,7 @@ agg_fill_hash_table(AggState *aggstate)
 	aggstate->table_filled = true;
 	/* Initialize to walk the first hash table */
 	select_current_set(aggstate, 0, true);
-	ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-						   &aggstate->perhash[0].hashiter);
+	ResetTupleHashIterator(currentphase->hashtable, &currentphase->hashiter);
 }
 
 /*
@@ -2041,7 +2085,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 	TupleHashEntryData *entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	AggStatePerHash perhash;
+	AggStatePerPhaseHash perhash;
 
 	/*
 	 * get state info from node.
@@ -2052,11 +2096,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 	peragg = aggstate->peragg;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
-	/*
-	 * Note that perhash (and therefore anything accessed through it) can
-	 * change inside the loop, as we change between grouping sets.
-	 */
-	perhash = &aggstate->perhash[aggstate->current_set];
+	perhash = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * We loop retrieving groups until we find one satisfying
@@ -2075,18 +2115,15 @@ agg_retrieve_hash_table(AggState *aggstate)
 		entry = ScanTupleHashTable(perhash->hashtable, &perhash->hashiter);
 		if (entry == NULL)
 		{
-			int			nextset = aggstate->current_set + 1;
-
-			if (nextset < aggstate->num_hashes)
+			if (aggstate->current_phase + 1 < aggstate->numphases)
 			{
 				/*
 				 * Switch to next grouping set, reinitialize, and restart the
 				 * loop.
 				 */
-				select_current_set(aggstate, nextset, true);
-
-				perhash = &aggstate->perhash[aggstate->current_set];
-
+				select_current_set(aggstate, 0, true);
+				initialize_phase(aggstate, aggstate->current_phase + 1);
+				perhash = (AggStatePerPhaseHash) aggstate->phase;	
 				ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 				continue;
@@ -2165,23 +2202,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	AggState   *aggstate;
 	AggStatePerAgg peraggs;
 	AggStatePerTrans pertransstates;
-	AggStatePerGroup *pergroups;
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
-	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
-	int			phase;
 	int			phaseidx;
 	ListCell   *l;
 	Bitmapset  *all_grouped_cols = NULL;
 	int			numGroupingSets = 1;
-	int			numPhases;
-	int			numHashes;
 	int			i = 0;
 	int			j = 0;
+	bool		need_extra_slot = false;
 	bool		use_hashing = (node->aggstrategy == AGG_HASHED ||
 							   node->aggstrategy == AGG_MIXED);
 
@@ -2210,24 +2243,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->curpertrans = NULL;
 	aggstate->input_done = false;
 	aggstate->agg_done = false;
-	aggstate->pergroups = NULL;
 	aggstate->grp_firstTuple = NULL;
-	aggstate->sort_in = NULL;
-	aggstate->sort_out = NULL;
 	aggstate->input_sorted = true;
-
-	/*
-	 * phases[0] always exists, but is dummy in sorted/plain mode
-	 */
-	numPhases = (use_hashing ? 1 : 2);
-	numHashes = (use_hashing ? 1 : 0);
-
-	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+	aggstate->eflags = eflags;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
-	 * determines the size of some allocations.  Also calculate the number of
-	 * phases, since all hashed/mixed nodes contribute to only a single phase.
+	 * determines the size of some allocations.
 	 */
 	if (node->groupingSets)
 	{
@@ -2240,31 +2262,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			numGroupingSets = Max(numGroupingSets,
 								  list_length(agg->groupingSets));
 
-			/*
-			 * additional AGG_HASHED aggs become part of phase 0, but all
-			 * others add an extra phase.
-			 */
 			if (agg->aggstrategy != AGG_HASHED)
-			{
-				++numPhases;
-
-				if (!firstSortAgg)
-					firstSortAgg = agg;
-
-			}
-			else
-				++numHashes;
+				need_extra_slot = true;
 		}
 	}
 
 	aggstate->maxsets = numGroupingSets;
-	aggstate->numphases = numPhases;
-
+	aggstate->numphases = 1 + list_length(node->chain);
+	
 	/*
-	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
-	if (firstSortAgg && firstSortAgg->sortnode)
+	if (node->sortnode)
 		aggstate->input_sorted = false;	
 
 	aggstate->aggcontexts = (ExprContext **)
@@ -2325,11 +2335,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	scanDesc = aggstate->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
 
 	/*
-	 * If there are more than two phases (including a potential dummy phase
-	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 * An extra slot is needed if 1) agg need to do its own sort 2) agg
+	 * has more than one non-hashed phases
 	 */
-	if (numPhases > 2 ||
-		!aggstate->input_sorted)
+	if (node->sortnode || need_extra_slot)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2381,76 +2390,94 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numaggs = aggstate->numaggs;
 	Assert(numaggs == list_length(aggstate->aggs));
 
-	/*
+	/* 
 	 * For each phase, prepare grouping set data and fmgr lookup data for
 	 * compare functions.  Accumulate all_grouped_cols in passing.
 	 */
-	aggstate->phases = palloc0(numPhases * sizeof(AggStatePerPhaseData));
-
-	aggstate->num_hashes = numHashes;
-	if (numHashes)
-	{
-		aggstate->perhash = palloc0(sizeof(AggStatePerHashData) * numHashes);
-		aggstate->phases[0].numsets = 0;
-		aggstate->phases[0].gset_lengths = palloc(numHashes * sizeof(int));
-		aggstate->phases[0].grouped_cols = palloc(numHashes * sizeof(Bitmapset *));
-	}
+	aggstate->phases = palloc0(aggstate->numphases * sizeof(AggStatePerPhase));
 
-	phase = 0;
-	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
 		Agg		   *aggnode;
+		AggStatePerPhase phasedata = NULL;
 
 		if (phaseidx > 0)
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
 		else
 			aggnode = node;
 
-		if (aggnode->aggstrategy == AGG_HASHED
-			|| aggnode->aggstrategy == AGG_MIXED)
+		if (aggnode->aggstrategy == AGG_HASHED)
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[0];
-			AggStatePerHash perhash;
-			Bitmapset  *cols = NULL;
-
-			Assert(phase == 0);
-			i = phasedata->numsets++;
-			perhash = &aggstate->perhash[i];
+			AggStatePerPhaseHash perhash;
+			Bitmapset *cols = NULL;
 
-			/* phase 0 always points to the "real" Agg in the hash case */
-			phasedata->aggnode = node;
-			phasedata->aggstrategy = node->aggstrategy;
+			perhash = (AggStatePerPhaseHash) palloc0(sizeof(AggStatePerPhaseHashData));
+			phasedata = (AggStatePerPhase) perhash;
+			phasedata->is_hashed = true;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			/* but the actual Agg node representing this hash is saved here */
-			perhash->aggnode = aggnode;
+			/* AGG_HASHED always has only one set */
+			phasedata->numsets = 1;
 
-			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
+			phasedata->gset_lengths = palloc(sizeof(int));
+			phasedata->gset_lengths[0] = perhash->numCols = aggnode->numCols;
 
+			phasedata->grouped_cols = palloc(sizeof(Bitmapset *));
 			for (j = 0; j < aggnode->numCols; ++j)
 				cols = bms_add_member(cols, aggnode->grpColIdx[j]);
-
-			phasedata->grouped_cols[i] = cols;
+			phasedata->grouped_cols[0] = cols;
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
-			continue;
+
+			/* 
+			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
+			 * on the fly, all pergroup states are kept in hashtable, everytime
+			 * a tuple is processed, lookup_hash_entry() choose one group and
+			 * set phasedata->pergroups[0], then advance_aggregates can use it
+			 * to do transition in this group.
+			 * We do not need to allocate a real pergroup and set the pointer
+			 * here, there are too many pergroup states, lookup_hash_entry() will
+			 * allocate it.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup));
+
+			/*
+			 * Hash aggregate does not require the order of input tuples, so
+			 * we can do the transition immediately when a tuple is fetched,
+			 * which means we can do the transition concurrently with the
+			 * first phase.
+			 */
+			if (phaseidx > 0)
+			{
+				aggstate->phases[0]->concurrent_hashes =
+					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
+				/* current phase don't need to build transition functions */
+				phasedata->skip_build_trans = true;
+			}
 		}
 		else
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[++phase];
-			int			num_sets;
+			AggStatePerPhaseSort persort;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			persort = (AggStatePerPhaseSort) palloc0(sizeof(AggStatePerPhaseSortData));
+			phasedata = (AggStatePerPhase) persort;
+			phasedata->is_hashed = false;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (num_sets)
+			if (aggnode->groupingSets)
 			{
-				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
-				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
+				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
 
 				i = 0;
 				foreach(l, aggnode->groupingSets)
 				{
-					int			current_length = list_length(lfirst(l));
-					Bitmapset  *cols = NULL;
+					int		current_length = list_length(lfirst(l));
+					Bitmapset	*cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2467,37 +2494,49 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 			else
 			{
-				Assert(phaseidx == 0);
-
+				phasedata->numsets = 1;
 				phasedata->gset_lengths = NULL;
 				phasedata->grouped_cols = NULL;
 			}
 
+			/* 
+			 * Initialize pergroup states for AGG_SORTED/AGG_PLAIN/AGG_MIXED
+			 * phases, each set only have one group on the fly, all groups in
+			 * a set can reuse a pergroup state. Unlike AGG_HASHED, we
+			 * pre-allocate the pergroup states here.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup) * phasedata->numsets);
+
+			for (i = 0; i < phasedata->numsets; i++)
+			{
+				phasedata->pergroups[i] =
+					(AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData) * numaggs);
+			}
+
 			/*
 			 * If we are grouping, precompute fmgr lookup data for inner loop.
 			 */
-			if (aggnode->aggstrategy == AGG_SORTED)
+			if (aggnode->numCols > 0)
 			{
 				int			i = 0;
 
-				Assert(aggnode->numCols > 0);
-
 				/*
 				 * Build a separate function for each subset of columns that
 				 * need to be compared.
 				 */
-				phasedata->eqfunctions =
+				persort->eqfunctions =
 					(ExprState **) palloc0(aggnode->numCols * sizeof(ExprState *));
 
 				/* for each grouping set */
-				for (i = 0; i < phasedata->numsets; i++)
+				for (i = 0; i < phasedata->numsets && phasedata->gset_lengths; i++)
 				{
 					int			length = phasedata->gset_lengths[i];
 
-					if (phasedata->eqfunctions[length - 1] != NULL)
+					if (persort->eqfunctions[length - 1] != NULL)
 						continue;
 
-					phasedata->eqfunctions[length - 1] =
+					persort->eqfunctions[length - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   length,
 											   aggnode->grpColIdx,
@@ -2507,9 +2546,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 
 				/* and for all grouped columns, unless already computed */
-				if (phasedata->eqfunctions[aggnode->numCols - 1] == NULL)
+				if (persort->eqfunctions[aggnode->numCols - 1] == NULL)
 				{
-					phasedata->eqfunctions[aggnode->numCols - 1] =
+					persort->eqfunctions[aggnode->numCols - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   aggnode->numCols,
 											   aggnode->grpColIdx,
@@ -2519,9 +2558,23 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 			}
 
-			phasedata->aggnode = aggnode;
-			phasedata->aggstrategy = aggnode->aggstrategy;
+			/*
+			 * For non-first AGG_SORTED phase, it processes the same input
+			 * tuples with previous phase except that it need to resort the
+			 * input tuples. Tell the previous phase to copy out the tuples.
+			 */
+			if (phaseidx > 0)
+			{
+				AggStatePerPhaseSort prev =
+					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
+
+				Assert(!prev->phasedata.is_hashed);
+				/* Tell the previous phase to copy the tuple to the sort_in */
+				prev->copy_out = true;
+			}
 		}
+
+		aggstate->phases[phaseidx] = phasedata;
 	}
 
 	/*
@@ -2545,51 +2598,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->peragg = peraggs;
 	aggstate->pertrans = pertransstates;
 
-
-	aggstate->all_pergroups =
-		(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup)
-									 * (numGroupingSets + numHashes));
-	pergroups = aggstate->all_pergroups;
-
-	if (node->aggstrategy != AGG_HASHED)
-	{
-		for (i = 0; i < numGroupingSets; i++)
-		{
-			pergroups[i] = (AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData)
-													  * numaggs);
-		}
-
-		aggstate->pergroups = pergroups;
-		pergroups += numGroupingSets;
-	}
-
 	/*
-	 * Hashing can only appear in the initial phase.
+	 * Initialize current phase-dependent values to initial phase.
 	 */
-	if (use_hashing)
-	{
-		/* this is an array of pointers, not structures */
-		aggstate->hash_pergroup = pergroups;
-	}
-
-	/*
-	 * Initialize current phase-dependent values to initial phase. The initial
-	 * phase is 1 (first sort pass) for all strategies that use sorting (if
-	 * hashing is being done too, then phase 0 is processed last); but if only
-	 * hashing is being done, then phase 0 is all there is.
-	 */
-	if (node->aggstrategy == AGG_HASHED)
-	{
-		aggstate->current_phase = 0;
-		initialize_phase(aggstate, 0);
-		select_current_set(aggstate, 0, true);
-	}
-	else
-	{
-		aggstate->current_phase = 1;
-		initialize_phase(aggstate, 1);
-		select_current_set(aggstate, 0, false);
-	}
+	aggstate->current_phase = 0;
+	initialize_phase(aggstate, 0);
+	select_current_set(aggstate, 0, aggstate->aggstrategy == AGG_HASHED);
 
 	/* -----------------
 	 * Perform lookups of aggregate function info, and initialize the
@@ -2942,49 +2956,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerPhase phase = &aggstate->phases[phaseidx];
-		bool		dohash = false;
-		bool		dosort = false;
+		AggStatePerPhase phase = aggstate->phases[phaseidx];
 
-		/* phase 0 doesn't necessarily exist */
-		if (!phase->aggnode)
-			continue;
-
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
-		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
-		}
-		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
-		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
+		if (phase->skip_build_trans)
 			continue;
-		}
-		else if (phase->aggstrategy == AGG_PLAIN ||
-				 phase->aggstrategy == AGG_SORTED)
-		{
-			dohash = false;
-			dosort = true;
-		}
-		else if (phase->aggstrategy == AGG_HASHED)
-		{
-			dohash = true;
-			dosort = false;
-		}
-		else
-			Assert(false);
-
-		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
 
+		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, false);
 	}
 
 	return aggstate;
@@ -3470,13 +3447,21 @@ ExecEndAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	/* Make sure we have closed any open tuplesorts */
+	for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
+	{
+		AggStatePerPhase		phase = node->phases[phaseidx++];
+		AggStatePerPhaseSort	persort;
 
-	if (node->sort_in)
-		tuplesort_end(node->sort_in);
-	if (node->sort_out)
-		tuplesort_end(node->sort_out);
+		if (phase->is_hashed)
+			continue;
+
+		persort = (AggStatePerPhaseSort) phase;
+		if (persort->sort_in)
+			tuplesort_end(persort->sort_in);
+	}
 
 	for (transno = 0; transno < node->numtrans; transno++)
 	{
@@ -3518,6 +3503,7 @@ ExecReScanAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	node->agg_done = false;
 
@@ -3541,8 +3527,12 @@ ExecReScanAgg(AggState *node)
 		if (outerPlan->chgParam == NULL &&
 			!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
 		{
-			ResetTupleHashIterator(node->perhash[0].hashtable,
-								   &node->perhash[0].hashiter);
+			AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) node->phases[0];
+			ResetTupleHashIterator(perhash->hashtable,
+								   &perhash->hashiter);
+
+			/* reset to phase 0 */
+			initialize_phase(node, 0);
 			select_current_set(node, 0, true);
 			return;
 		}
@@ -3607,18 +3597,54 @@ ExecReScanAgg(AggState *node)
 		/*
 		 * Reset the per-group state (in particular, mark transvalues null)
 		 */
-		for (setno = 0; setno < numGroupingSets; setno++)
+		for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
 		{
-			MemSet(node->pergroups[setno], 0,
-				   sizeof(AggStatePerGroupData) * node->numaggs);
+			AggStatePerPhase phase = node->phases[phaseidx];
+
+			/* hash pergroups is reset by build_hash_tables */
+			if (phase->is_hashed)
+				continue;
+
+			for (setno = 0; setno < phase->numsets; setno++)
+				MemSet(phase->pergroups[setno], 0,
+					   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
-		/* Reset input_sorted */
+		/* 
+		 * the agg did its own first sort using tuplesort and the first
+		 * tuplesort is kept (see initialize_phase), if the subplan does
+		 * not have any parameter changes, and none of our own parameter
+		 * changes affect input expressions of the aggregated functions,
+		 * then we can just rescan the first tuplesort, no need to build
+		 * it again.
+		 *
+		 * Note: agg only do its own sort for groupingsets now.
+		 */
 		if (aggnode->sortnode)
-			node->input_sorted = false;
+		{
+			AggStatePerPhaseSort firstphase = (AggStatePerPhaseSort) node->phases[0];
+			bool randomAccess = (node->eflags & (EXEC_FLAG_REWIND |
+												 EXEC_FLAG_BACKWARD |
+												 EXEC_FLAG_MARK)) != 0;
+			if (firstphase->sort_in &&
+				randomAccess &&
+				outerPlan->chgParam == NULL &&
+				!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
+			{
+				tuplesort_rescan(firstphase->sort_in);
+				node->input_sorted = true;
+			}
+			else
+			{
+				if (firstphase->sort_in)
+					tuplesort_end(firstphase->sort_in);
+				firstphase->sort_in = NULL;
+				node->input_sorted = false;
+			}
+		}
 
-		/* reset to phase 1 */
-		initialize_phase(node, 1);
+		/* reset to phase 0 */
+		initialize_phase(node, 0);
 
 		node->input_done = false;
 		node->projected_set = -1;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index b855e73957..066cd59554 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2049,30 +2049,26 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_AGG_PLAIN_PERGROUP_NULLCHECK:
 				{
 					int				 jumpnull;
-					LLVMValueRef	 v_aggstatep;
-					LLVMValueRef	 v_allpergroupsp;
+					LLVMValueRef	 v_pergroupsp;
 					LLVMValueRef	 v_pergroup_allaggs;
-					LLVMValueRef	 v_setoff;
+					LLVMValueRef	 v_setno;
 
 					jumpnull = op->d.agg_plain_pergroup_nullcheck.jumpnull;
 
 					/*
-					 * pergroup_allaggs = aggstate->all_pergroups
-					 * [op->d.agg_plain_pergroup_nullcheck.setoff];
+					 * pergroup =
+					 * &op->d.agg_plain_pergroup_nullcheck.pergroups
+					 * [op->d.agg_plain_pergroup_nullcheck.setno];
 					 */
-					v_aggstatep = LLVMBuildBitCast(
-						b, v_parent, l_ptr(StructAggState), "");
+					v_pergroupsp =
+						l_ptr_const(op->d.agg_plain_pergroup_nullcheck.pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
 
-					v_allpergroupsp = l_load_struct_gep(
-						b, v_aggstatep,
-						FIELDNO_AGGSTATE_ALL_PERGROUPS,
-						"aggstate.all_pergroups");
+					v_setno =
+						l_int32_const(op->d.agg_plain_pergroup_nullcheck.setno);
 
-					v_setoff = l_int32_const(
-						op->d.agg_plain_pergroup_nullcheck.setoff);
-
-					v_pergroup_allaggs = l_load_gep1(
-						b, v_allpergroupsp, v_setoff, "");
+					v_pergroup_allaggs =
+						l_load_gep1(b, v_pergroupsp, v_setno, "");
 
 					LLVMBuildCondBr(
 						b,
@@ -2094,6 +2090,7 @@ llvm_compile_expr(ExprState *state)
 				{
 					AggState   *aggstate;
 					AggStatePerTrans pertrans;
+					AggStatePerGroup *pergroups;
 					FunctionCallInfo fcinfo;
 
 					LLVMValueRef v_aggstatep;
@@ -2103,12 +2100,12 @@ llvm_compile_expr(ExprState *state)
 					LLVMValueRef v_transvaluep;
 					LLVMValueRef v_transnullp;
 
-					LLVMValueRef v_setoff;
+					LLVMValueRef v_setno;
 					LLVMValueRef v_transno;
 
 					LLVMValueRef v_aggcontext;
 
-					LLVMValueRef v_allpergroupsp;
+					LLVMValueRef v_pergroupsp;
 					LLVMValueRef v_current_setp;
 					LLVMValueRef v_current_pertransp;
 					LLVMValueRef v_curaggcontext;
@@ -2124,6 +2121,7 @@ llvm_compile_expr(ExprState *state)
 
 					aggstate = castNode(AggState, state->parent);
 					pertrans = op->d.agg_trans.pertrans;
+					pergroups = op->d.agg_trans.pergroups;
 
 					fcinfo = pertrans->transfn_fcinfo;
 
@@ -2133,19 +2131,18 @@ llvm_compile_expr(ExprState *state)
 											  l_ptr(StructAggStatePerTransData));
 
 					/*
-					 * pergroup = &aggstate->all_pergroups
-					 * [op->d.agg_strict_trans_check.setoff]
-					 * [op->d.agg_init_trans_check.transno];
+					 * pergroup = &op->d.agg_trans.pergroups
+					 * [op->d.agg_trans.setno]
+					 * [op->d.agg_trans.transno];
 					 */
-					v_allpergroupsp =
-						l_load_struct_gep(b, v_aggstatep,
-										  FIELDNO_AGGSTATE_ALL_PERGROUPS,
-										  "aggstate.all_pergroups");
-					v_setoff = l_int32_const(op->d.agg_trans.setoff);
+					v_pergroupsp =
+						l_ptr_const(pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
+					v_setno = l_int32_const(op->d.agg_trans.setno);
 					v_transno = l_int32_const(op->d.agg_trans.transno);
 					v_pergroupp =
 						LLVMBuildGEP(b,
-									 l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
+									 l_load_gep1(b, v_pergroupsp, v_setno, ""),
 									 &v_transno, 1, "");
 
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d5b34089aa..c33f0b134b 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2227,8 +2227,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	chain = NIL;
 	if (list_length(rollups) > 1)
 	{
-		bool		is_first_sort = ((RollupData *) linitial(rollups))->is_hashed;
-
 		for_each_cell(lc, rollups, list_second_cell(rollups))
 		{
 			RollupData *rollup = lfirst(lc);
@@ -2241,24 +2239,17 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			if (!rollup->is_hashed)
 			{
-				if (!is_first_sort ||
-					(is_first_sort && !best_path->is_sorted))
-				{
-					sort_plan = (Plan *)
-						make_sort_from_groupcols(rollup->groupClause,
-												 new_grpColIdx,
-												 subplan);
-
-					/*
-					 * Remove stuff we don't need to avoid bloating debug output.
-					 */
-					sort_plan->targetlist = NIL;
-					sort_plan->lefttree = NULL;
-				}
-			}
+				sort_plan = (Plan *)
+					make_sort_from_groupcols(rollup->groupClause,
+											 new_grpColIdx,
+											 subplan);
 
-			if (!rollup->is_hashed)
-				is_first_sort = false;
+				/*
+				 * Remove stuff we don't need to avoid bloating debug output.
+				 */
+				sort_plan->targetlist = NIL;
+				sort_plan->lefttree = NULL;
+			}
 
 			if (rollup->is_hashed)
 				strat = AGG_HASHED;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 82a15761b4..28ae0644bd 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4347,7 +4347,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (unhashed_rollup)
 		{
-			new_rollups = lappend(new_rollups, unhashed_rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(unhashed_rollup, new_rollups);
 			strat = AGG_MIXED;
 		}
 		else if (empty_sets)
@@ -4360,7 +4361,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = list_length(empty_sets);
 			rollup->hashable = false;
 			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(rollup, new_rollups);
 			/* update is_sorted to true */
 			is_sorted = true;
 			strat = AGG_MIXED;
@@ -4523,7 +4525,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			/* non-hashed rollup always sit before hashed rollup */
+			rollups = lappend(rollups, rollup);
 		}
 
 		if (rollups)
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0feb3363d3..2dfa3fa17e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2983,7 +2983,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
- * 'is_sorted' is the input sorted in the group cols of first rollup
+ * 'is_sorted' is the input sorted in the group cols of first rollup 
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -3000,7 +3000,6 @@ create_groupingsets_path(PlannerInfo *root,
 	PathTarget *target = rel->reltarget;
 	ListCell   *lc;
 	bool		is_first = true;
-	bool		is_first_sort = true;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3053,14 +3052,13 @@ create_groupingsets_path(PlannerInfo *root,
 		int			numGroupCols = list_length(linitial(gsets));
 
 		/*
-		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup takes the
-		 * (already-sorted) input, and following ones do their own sort.
+		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
+		 * sort if is_sorted is false, the following ones do their own sort.
 		 *
 		 * In AGG_HASHED mode, there is one rollup for each grouping set.
 		 *
-		 * In AGG_MIXED mode, the first rollups are hashed, the first
-		 * non-hashed one takes the (already-sorted) input, and following ones
-		 * do their own sort.
+		 * In AGG_MIXED mode, the first rollup do its own sort if is_sorted
+		 * is false, the following non-hashed ones do their own sort.
 		 */
 		if (is_first)
 		{
@@ -3092,33 +3090,23 @@ create_groupingsets_path(PlannerInfo *root,
 					 input_startup_cost,
 					 input_total_cost,
 					 subpath->rows);
+
 			is_first = false;
-			if (!rollup->is_hashed)
-				is_first_sort = false;
 		}
 		else
 		{
-			Path		sort_path;	/* dummy for result of cost_sort */
-			Path		agg_path;	/* dummy for result of cost_agg */
-
-			if (rollup->is_hashed || (is_first_sort && is_sorted))
-			{
-				/*
-				 * Account for cost of aggregation, but don't charge input
-				 * cost again
-				 */
-				cost_agg(&agg_path, root,
-						 rollup->is_hashed ? AGG_HASHED : AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 0.0, 0.0,
-						 subpath->rows);
-				if (!rollup->is_hashed)
-					is_first_sort = false;
-			}
-			else
+			AggStrategy	rollup_strategy;
+			Path	sort_path;	/* dummy for result of cost_sort */
+			Path	agg_path;	/* dummy for result of cost_agg */
+			
+			sort_path.startup_cost = 0;
+			sort_path.total_cost = 0;
+			sort_path.rows = subpath->rows;
+
+			rollup_strategy = rollup->is_hashed ?
+				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
+
+			if (!rollup->is_hashed && numGroupCols)
 			{
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
@@ -3128,20 +3116,19 @@ create_groupingsets_path(PlannerInfo *root,
 						  0.0,
 						  work_mem,
 						  -1.0);
-
-				/* Account for cost of aggregation */
-
-				cost_agg(&agg_path, root,
-						 AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 sort_path.startup_cost,
-						 sort_path.total_cost,
-						 sort_path.rows);
 			}
 
+			/* Account for cost of aggregation */
+			cost_agg(&agg_path, root,
+					 rollup_strategy,
+					 agg_costs,
+					 numGroupCols,
+					 rollup->numGroups,
+					 having_qual,
+					 sort_path.startup_cost,
+					 sort_path.total_cost,
+					 sort_path.rows);
+
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index dbe8649a57..4ed5d0a7de 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -626,7 +626,8 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_PLAIN_PERGROUP_NULLCHECK */
 		struct
 		{
-			int			setoff;
+			AggStatePerGroup *pergroups;
+			int			setno;
 			int			jumpnull;
 		}			agg_plain_pergroup_nullcheck;
 
@@ -634,11 +635,11 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
 		struct
 		{
+			AggStatePerGroup *pergroups;
 			AggStatePerTrans pertrans;
 			ExprContext *aggcontext;
 			int			setno;
 			int			transno;
-			int			setoff;
 		}			agg_trans;
 	}			d;
 } ExprEvalStep;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 94890512dc..1f37f9236b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool nullcheck);
 extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
 										 const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
 										 int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 66a83b9ac9..c5d4121c37 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -270,16 +270,29 @@ typedef struct AggStatePerGroupData
  */
 typedef struct AggStatePerPhaseData
 {
+	bool		is_hashed;		/* plan to do hash aggregate */
 	AggStrategy aggstrategy;	/* strategy for this phase */
-	int			numsets;		/* number of grouping sets (or 0) */
+	int			numsets;		/* number of grouping sets */
 	int		   *gset_lengths;	/* lengths of grouping sets */
 	Bitmapset **grouped_cols;	/* column groupings for rollup */
-	ExprState **eqfunctions;	/* expression returning equality, indexed by
-								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+
+	List		*concurrent_hashes;	/* hash phases can do transition concurrently */
+	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
+	bool		skip_build_trans;
 }			AggStatePerPhaseData;
 
+typedef struct AggStatePerPhaseSortData
+{
+	AggStatePerPhaseData phasedata;
+	Tuplesortstate	*sort_in;		/* sorted input to phases > 1 */
+	Tuplestorestate	*store_in;		/* sorted input to phases > 1 */
+	ExprState 		**eqfunctions;	/* expression returning equality, indexed by
+									 * nr of cols to compare */
+	bool			copy_out;		/* hint for copy input tuples for next phase */
+}			AggStatePerPhaseSortData;
+
 /*
  * AggStatePerHashData - per-hashtable state
  *
@@ -287,8 +300,9 @@ typedef struct AggStatePerPhaseData
  * grouping set. (When doing hashing without grouping sets, we have just one of
  * them.)
  */
-typedef struct AggStatePerHashData
+typedef struct AggStatePerPhaseHashData
 {
+	AggStatePerPhaseData phasedata;
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleHashIterator hashiter; /* for iterating through hash table */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
@@ -299,9 +313,7 @@ typedef struct AggStatePerHashData
 	int			largestGrpColIdx;	/* largest col required for hashing */
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
-	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
-
+}			AggStatePerPhaseHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5e33a368f5..4081a0978e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2036,7 +2036,8 @@ typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
 typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
-typedef struct AggStatePerHashData *AggStatePerHash;
+typedef struct AggStatePerPhaseSortData *AggStatePerPhaseSort;
+typedef struct AggStatePerPhaseHashData *AggStatePerPhaseHash;
 
 typedef struct AggState
 {
@@ -2068,28 +2069,19 @@ typedef struct AggState
 	List	   *all_grouped_cols;	/* list of all grouped cols in DESC order */
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
-	AggStatePerPhase phases;	/* array of all phases */
+	AggStatePerPhase *phases;	/* array of all phases */
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
-	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
-									 * pointers */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
-	/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
+	/* these fields are used in AGG_HASHED */
 	bool		table_filled;	/* hash table filled yet? */
-	int			num_hashes;
-	AggStatePerHash perhash;	/* array of per-hashtable data */
-	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
-										 * per-group pointers */
 
 	/* these fields are used in AGG_SORTED and AGG_MIXED */
 	bool		input_sorted;	/* hash table filled yet? */
+	int			eflags;			/* eflags for the first sort */
 
-	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 35
-	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
-										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
 } AggState;
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 12425f46ca..e7689ebd16 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1004,10 +1004,10 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
  Sort
    Sort Key: (GROUPING("*VALUES*".column1, "*VALUES*".column2)), "*VALUES*".column1, "*VALUES*".column2
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: "*VALUES*".column1, "*VALUES*".column2
          Hash Key: "*VALUES*".column1
          Hash Key: "*VALUES*".column2
-         Group Key: ()
          ->  Values Scan on "*VALUES*"
 (8 rows)
 
@@ -1066,9 +1066,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: unsortable_col
          Sort Key: unhashable_col
            Group Key: unhashable_col
+         Hash Key: unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1108,9 +1108,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: v, unsortable_col
          Sort Key: v, unhashable_col
            Group Key: v, unhashable_col
+         Hash Key: v, unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1149,10 +1149,10 @@ explain (costs off)
            QUERY PLAN           
 --------------------------------
  MixedAggregate
-   Hash Key: a, b
    Group Key: ()
    Group Key: ()
    Group Key: ()
+   Hash Key: a, b
    ->  Seq Scan on gstest_empty
 (6 rows)
 
@@ -1310,10 +1310,10 @@ explain (costs off)
          ->  Sort
                Sort Key: a, b
                ->  MixedAggregate
+                     Group Key: ()
                      Hash Key: a, b
                      Hash Key: a
                      Hash Key: b
-                     Group Key: ()
                      ->  Seq Scan on gstest2
 (11 rows)
 
@@ -1345,10 +1345,10 @@ explain (costs off)
  Sort
    Sort Key: gstest_data.a, gstest_data.b
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: gstest_data.a, gstest_data.b
          Hash Key: gstest_data.a
          Hash Key: gstest_data.b
-         Group Key: ()
          ->  Nested Loop
                ->  Values Scan on "*VALUES*"
                ->  Function Scan on gstest_data
@@ -1545,16 +1545,16 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (12 rows)
 
@@ -1567,12 +1567,12 @@ explain (costs off)
        QUERY PLAN        
 -------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (8 rows)
 
@@ -1586,15 +1586,15 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
-   Hash Key: thousand
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
+   Hash Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (11 rows)
 
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..7818f02032 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -340,8 +340,8 @@ SELECT c, sum(a) FROM pagg_tab GROUP BY rollup(c) ORDER BY 1, 2;
  Sort
    Sort Key: pagg_tab.c, (sum(pagg_tab.a))
    ->  MixedAggregate
-         Hash Key: pagg_tab.c
          Group Key: ()
+         Hash Key: pagg_tab.c
          ->  Append
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2
-- 
2.14.1

0004-Parallel-grouping-sets.patchapplication/octet-stream; name=0004-Parallel-grouping-sets.patchDownload

From a0c9b7f93201dabfd030f48de0da5875cf0923f4 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:08:11 -0400
Subject: [PATCH 4/4] Parallel grouping sets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets using multiple workers.

the main idea of parallel grouping sets is: like parallel aggregate,  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multiple grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

The final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizations in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.
---
 src/backend/commands/explain.c          |  10 +-
 src/backend/executor/execExpr.c         |  10 +-
 src/backend/executor/execExprInterp.c   |  11 +
 src/backend/executor/nodeAgg.c          | 261 ++++++++++++++++++++++--
 src/backend/jit/llvm/llvmjit_expr.c     |  40 ++++
 src/backend/nodes/copyfuncs.c           |  56 +++++-
 src/backend/nodes/equalfuncs.c          |   3 +
 src/backend/nodes/nodeFuncs.c           |   8 +
 src/backend/nodes/outfuncs.c            |  14 +-
 src/backend/nodes/readfuncs.c           |  53 ++++-
 src/backend/optimizer/path/allpaths.c   |   5 +-
 src/backend/optimizer/plan/createplan.c |  25 +--
 src/backend/optimizer/plan/planner.c    | 344 ++++++++++++++++++++++++--------
 src/backend/optimizer/plan/setrefs.c    |  16 ++
 src/backend/optimizer/util/pathnode.c   |  27 ++-
 src/backend/utils/adt/ruleutils.c       |   6 +
 src/include/executor/execExpr.h         |   1 +
 src/include/executor/nodeAgg.h          |   2 +
 src/include/nodes/execnodes.h           |   8 +-
 src/include/nodes/nodes.h               |   1 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |   4 +-
 src/include/nodes/primnodes.h           |   6 +
 src/include/optimizer/pathnode.h        |   1 +
 src/include/optimizer/planmain.h        |   2 +-
 25 files changed, 790 insertions(+), 126 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2c63cdb46c..8b6877c41e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2256,12 +2256,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->gsetid)
+		show_expression((Node *) plan->gsetid, "Filtered by",
+						(PlanState *) astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(plan, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2312,7 +2316,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index de76f296b3..cb809bb742 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -814,7 +814,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -823,6 +823,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index b0dbba4e55..b3537eb8d9 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -428,6 +428,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -1512,6 +1513,16 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			AggState   *aggstate = castNode(AggState, state->parent);
+
+			*op->resvalue = aggstate->phase->setno_gsetids[aggstate->current_set];
+			*op->resnull = false;
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 25e6eea822..89b3f50a06 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -282,6 +282,7 @@ static void lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash,
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static void agg_sort_input(AggState *aggstate);
+static void agg_preprocess_groupingsets(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
@@ -341,17 +342,26 @@ initialize_phase(AggState *aggstate, int newphase)
 	 * Whatever the previous state, we're now done with whatever input
 	 * tuplesort was in use, cleanup them.
 	 *
-	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * Note: we keep the first tuplesort/tuplestore when it's not the
+	 * final stage of partial groupingsets, this will benifit the
 	 * rescan in some cases without resorting the input again.
 	 */
-	if (!current_phase->is_hashed && aggstate->current_phase > 0)
+	if (!current_phase->is_hashed &&
+		(aggstate->current_phase > 0 || DO_AGGSPLIT_COMBINE(aggstate->aggsplit)))
 	{
 		persort = (AggStatePerPhaseSort) current_phase;
+
 		if (persort->sort_in)
 		{
 			tuplesort_end(persort->sort_in);
 			persort->sort_in = NULL;
 		}
+
+		if (persort->store_in)
+		{
+			tuplestore_end(persort->store_in);
+			persort->store_in = NULL;	
+		}
 	}
 
 	/* advance to next phase */
@@ -420,6 +430,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (current_phase->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(current_phase->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -1597,6 +1616,9 @@ ExecAgg(PlanState *pstate)
 
 	CHECK_FOR_INTERRUPTS();
 
+	if (node->groupingsets_preprocess)
+		agg_preprocess_groupingsets(node);
+
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -1637,7 +1659,7 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	bool		hasGroupingSets = aggstate->phase->aggnode->rollup != NULL;
 	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
@@ -1970,6 +1992,135 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+/*
+ * Routine for final phase of partial grouping sets:
+ *
+ * Preprocess tuples for final phase of grouping sets. In initial phase,
+ * tuples is decorated with a grouping set ID and in the final phase, all
+ * grouping set are fit into different aggregate phases, so we need to
+ * redirect the tuples to each aggregate phases according to the grouping
+ * set ID.
+ */
+static void
+agg_preprocess_groupingsets(AggState *aggstate)
+{
+	AggStatePerPhaseSort	persort;
+	AggStatePerPhaseHash	perhash;
+	AggStatePerPhase	phase;
+	TupleTableSlot		*outerslot;
+	ExprContext			*tmpcontext = aggstate->tmpcontext;
+	int					phaseidx;
+
+	Assert(DO_AGGSPLIT_COMBINE(aggstate->aggsplit));
+	Assert(aggstate->groupingsets_preprocess);
+
+	/* Initialize tuples storage for each aggregate phases */
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
+	{
+		phase = aggstate->phases[phaseidx];	
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+			if (phase->aggnode->sortnode)
+			{
+				Sort	   *sortnode = (Sort *) phase->aggnode->sortnode;
+				PlanState  *outerNode = outerPlanState(aggstate);
+				TupleDesc	tupDesc = ExecGetResultType(outerNode);
+
+				persort->sort_in = tuplesort_begin_heap(tupDesc,
+														sortnode->numCols,
+														sortnode->sortColIdx,
+														sortnode->sortOperators,
+														sortnode->collations,
+														sortnode->nullsFirst,
+														work_mem,
+														NULL, false);
+			}
+			else
+			{
+				persort->store_in = tuplestore_begin_heap(false, false, work_mem);	
+			}
+		}
+		else
+		{
+			/* 
+			 * If it's a AGG_HASHED, we don't need a storage to store
+			 * the tuples for later process, we can do the transition
+			 * immediately.
+			 */
+		}
+	}
+
+	for (;;)
+	{
+		Datum	ret;
+		bool	isNull;
+		int		setid;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tmpcontext->ecxt_outertuple = outerslot;
+
+		/* Finger out which group set the tuple belongs to ?*/
+		ret = ExecEvalExprSwitchContext(aggstate->gsetid, tmpcontext, &isNull);
+
+		setid = DatumGetInt32(ret);
+		phase = aggstate->phases[aggstate->gsetid_phaseidxs[setid]];
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+
+			Assert(persort->sort_in || persort->store_in);
+
+			if (persort->sort_in)
+				tuplesort_puttupleslot(persort->sort_in, outerslot);
+			else if (persort->store_in)
+				tuplestore_puttupleslot(persort->store_in, outerslot);
+		}
+		else
+		{
+			int		hash;
+			bool	dummynull;
+
+			perhash = (AggStatePerPhaseHash) phase;
+
+			/* If it is hashed, we can do the transition now. */
+			select_current_set(aggstate, 0, true);
+			prepare_hash_slot(aggstate, perhash);
+			hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
+			lookup_hash_entry(aggstate, perhash, hash);
+
+			ExecEvalExprSwitchContext(phase->evaltrans,
+									  tmpcontext,
+									  &dummynull);
+		}
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Sort the first phase if needed */
+	if (aggstate->aggstrategy != AGG_HASHED)
+	{
+		persort = (AggStatePerPhaseSort) aggstate->phase;
+
+		if (persort->sort_in)
+			tuplesort_performsort(persort->sort_in);
+	}
+
+	/* Mark the hash table to be filled */
+	aggstate->table_filled = true;
+
+	/* Mark the input table to be sorted */
+	aggstate->input_sorted = true;
+
+	/* Mark the flag to not preprocessing groupingsets again */
+	aggstate->groupingsets_preprocess = false;
+}
+
 static void
 agg_sort_input(AggState *aggstate)
 {
@@ -2246,21 +2397,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->input_sorted = true;
 	aggstate->eflags = eflags;
+	aggstate->groupingsets_preprocess = false;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			if (agg->aggstrategy != AGG_HASHED)
 				need_extra_slot = true;
@@ -2270,6 +2422,28 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = 1 + list_length(node->chain);
 	
+	/* 
+	 * We are doing final stage of partial groupingsets, do preprocess
+	 * to input tuples first, redirect the tuples to according aggregate
+	 * phases. See agg_preprocess_groupingsets().
+	 */
+	if (node->rollup && DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		aggstate->groupingsets_preprocess = true;
+
+		/* 
+		 * Allocate gsetid <-> phases mapping, in final stage of
+		 * partial groupingsets, all grouping sets are extracted
+		 * to individual phases, so the number of sets is equal
+		 * to the number of phases
+		 */
+		aggstate->gsetid_phaseidxs =
+			(int *) palloc0(aggstate->numphases * sizeof(int));
+
+		if (aggstate->aggstrategy != AGG_HASHED)
+			need_extra_slot = true;
+	}
+
 	/*
 	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
@@ -2384,6 +2558,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->ss.ps.qual =
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
+	/*
+	 * Initialize expression state to fetch grouping set id from
+	 * the partial groupingsets aggregate result.
+	 */
+	aggstate->gsetid =
+		ExecInitExpr(node->gsetid, (PlanState *)aggstate);
 	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
@@ -2430,6 +2610,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
 
+			/* 
+			 * In the initial stage of partial grouping sets, it provides extra
+			 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+			 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+			 * the output.
+			 */
+			if (aggnode->rollup &&
+				DO_AGGSPLIT_SERIALIZE(aggnode->aggsplit))
+			{
+				GroupingSetData	*gs;
+				phasedata->setno_gsetids = palloc(sizeof(int));
+				gs = linitial_node(GroupingSetData,
+								   aggnode->rollup->gsets_data);
+				phasedata->setno_gsetids[0] = gs->setId;
+			}
+	
 			/* 
 			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
 			 * on the fly, all pergroup states are kept in hashtable, everytime
@@ -2448,8 +2644,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * we can do the transition immediately when a tuple is fetched,
 			 * which means we can do the transition concurrently with the
 			 * first phase.
+			 *
+			 * Note: this is not work for final phase of partial groupingsets in
+			 * which the partial input tuple has a specified target aggregate
+			 * phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				aggstate->phases[0]->concurrent_hashes =
 					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
@@ -2467,17 +2667,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (aggnode->groupingSets)
+			if (aggnode->rollup)
 			{
-				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->numsets = list_length(aggnode->rollup->gsets_data);
 				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
 				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
+				phasedata->setno_gsetids = palloc(phasedata->numsets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int		current_length = list_length(lfirst(l));
-					Bitmapset	*cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
+					Bitmapset *cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2486,6 +2688,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
 
+					/* 
+					 * In the initial stage of partial grouping sets, it provides extra
+					 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+					 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+					 * the output.
+					 */
+					if (DO_AGGSPLIT_SERIALIZE(aggstate->aggsplit))
+						phasedata->setno_gsetids[i] = gs->setId;
+
 					++i;
 				}
 
@@ -2562,8 +2773,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * For non-first AGG_SORTED phase, it processes the same input
 			 * tuples with previous phase except that it need to resort the
 			 * input tuples. Tell the previous phase to copy out the tuples.
+			 *
+			 * Note: it doesn't work for final stage of partial grouping sets
+			 * in which tuple has specified target aggregate phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				AggStatePerPhaseSort prev =
 					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
@@ -2574,6 +2788,18 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 		}
 
+		/*
+		 * Fill the gsetid_phaseidxs array, so we can find according phases
+		 * using gsetid.
+		 */
+		if (aggstate->groupingsets_preprocess)
+		{
+			GroupingSetData *gs =
+				linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+			aggstate->gsetid_phaseidxs[gs->setId] = phaseidx;
+		}
+
 		aggstate->phases[phaseidx] = phasedata;
 	}
 
@@ -3461,6 +3687,8 @@ ExecEndAgg(AggState *node)
 		persort = (AggStatePerPhaseSort) phase;
 		if (persort->sort_in)
 			tuplesort_end(persort->sort_in);
+		if (persort->store_in)
+			tuplestore_end(persort->store_in);
 	}
 
 	for (transno = 0; transno < node->numtrans; transno++)
@@ -3643,6 +3871,13 @@ ExecReScanAgg(AggState *node)
 			}
 		}
 
+		/* 
+		 * if the agg is doing final stage of partial groupingsets, reset the
+		 * flag to do groupingsets preprocess again.
+		 */
+		if (aggnode->rollup && DO_AGGSPLIT_COMBINE(node->aggsplit))
+			node->groupingsets_preprocess = true;
+
 		/* reset to phase 0 */
 		initialize_phase(node, 0);
 
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 066cd59554..f442442269 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -1882,6 +1882,46 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_GROUPING_SET_ID:
+				{
+					LLVMValueRef v_resvalue;
+					LLVMValueRef v_aggstatep;
+					LLVMValueRef v_phase;
+					LLVMValueRef v_current_set;
+					LLVMValueRef v_setno_gsetids;
+
+					v_aggstatep =
+						LLVMBuildBitCast(b, v_parent, l_ptr(StructAggState), "");
+
+					/* 
+					 * op->resvalue =
+					 * aggstate->phase->setno_gsetids
+					 * [aggstate->current_set]
+					 */
+					v_phase =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_PHASE,
+										  "aggstate.phase");
+					v_setno_gsetids =
+						l_load_struct_gep(b, v_phase,
+										  FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS,
+										  "aggstateperphase.setno_gsetids");
+					v_current_set =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_CURRENT_SET,
+										  "aggstate.current_set");
+					v_resvalue =
+						l_load_gep1(b, v_setno_gsetids, v_current_set, "");
+					v_resvalue =
+						LLVMBuildZExt(b, v_resvalue, TypeSizeT, "");
+
+					LLVMBuildStore(b, v_resvalue, v_resvaluep);
+					LLVMBuildStore(b, l_sbool_const(0), v_resnullp);
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_WINDOW_FUNC:
 				{
 					WindowFuncExprState *wfunc = op->d.window_func.wfstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 20ed43604e..691857fb99 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -990,8 +990,9 @@ _copyAgg(const Agg *from)
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_SCALAR_FIELD(transitionSpace);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(gsetid);
 	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
@@ -1478,6 +1479,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 	return newnode;
 }
 
+/*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(setId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
 /*
  * _copyWindowFunc
  */
@@ -4961,6 +5006,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5594,6 +5642,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 5b1ba143b1..7589bce3c8 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3069,6 +3069,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index d85ca9f7c5..877ea0bc16 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -62,6 +62,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -740,6 +743,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1869,6 +1875,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2575,6 +2582,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5816d122c1..efcb1c7d4f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -785,8 +785,9 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_UINT64_FIELD(transitionSpace);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(gsetid);
 	WRITE_NODE_FIELD(sortnode);
 }
 
@@ -1150,6 +1151,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 	WRITE_LOCATION_FIELD(location);
 }
 
+static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
 static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
@@ -2002,6 +2010,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(setId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3847,6 +3856,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index af4fcfe1ee..c9a3340f58 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -636,6 +636,50 @@ _readGroupingFunc(void)
 	READ_DONE();
 }
 
+/*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(setId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
 /*
  * _readWindowFunc
  */
@@ -2205,8 +2249,9 @@ _readAgg(void)
 	READ_LONG_FIELD(numGroups);
 	READ_UINT64_FIELD(transitionSpace);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(gsetid);
 	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
@@ -2642,6 +2687,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..e6c7f080e0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2710,8 +2710,11 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
-	 * Merge.
+	 * Merge. Don't do this for partial groupingsets.
 	 */
+	if (root->parse->groupingSets)
+		return;
+
 	foreach(lc, rel->partial_pathlist)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index c33f0b134b..adb8123d6f 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1641,7 +1641,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 0,
@@ -2095,7 +2095,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
@@ -2215,7 +2215,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	 * never be grouping in an UPDATE/DELETE; but let's Assert that.
 	 */
 	Assert(root->inhTargetKind == INHKIND_NONE);
-	Assert(root->grouping_map == NULL);
 	root->grouping_map = grouping_map;
 
 	/*
@@ -2237,7 +2236,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-			if (!rollup->is_hashed)
+			/* In final stage, rollup may contain empty set here */
+			if (!rollup->is_hashed &&
+				list_length(linitial(rollup->gsets)) != 0)
 			{
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
@@ -2261,12 +2262,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
@@ -2278,8 +2279,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	}
 
 	/*
-	 * Now make the real Agg node
-	 */
+	 * Now make the real Agg node */
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
@@ -2308,12 +2308,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
@@ -6215,7 +6215,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 RollupData *rollup, List *chain, double dNumGroups,
 		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6234,8 +6234,9 @@ make_agg(List *tlist, List *qual,
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
+	node->gsetid = NULL;
 	node->sortnode = sortnode;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 28ae0644bd..e9b4492a02 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -113,6 +113,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   num_sets;
 } grouping_sets_data;
 
 /*
@@ -126,6 +127,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*AddPathCallback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -142,7 +145,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -176,7 +180,11 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
 										double dNumGroups,
-										AggStrategy strat);
+										List *havingQual,
+										AggStrategy strat,
+										AggSplit aggsplit,
+										AddPathCallback add_path_fn);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -250,6 +258,9 @@ static bool group_by_has_partkey(RelOptInfo *input_rel,
 								 List *groupClause);
 static int	common_prefix_cmp(const void *a, const void *b);
 
+static List *extract_final_rollups(PlannerInfo *root,
+								   grouping_sets_data *gd,
+								   List *rollups);
 
 /*****************************************************************************
  *
@@ -2494,6 +2505,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->setId = gd->num_sets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2538,7 +2550,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -3547,7 +3559,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3581,6 +3593,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->setId = gd->num_sets++;
 		result = lcons(gs, result);
 	}
 
@@ -4190,13 +4203,18 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  *
  * - strat:
  *   preferred aggregate strategy to use.
- * 
+ *
  * - is_sorted:
  *   Is the input sorted on the groupCols of the first rollup. Caller
  *   must set it correctly if strat is set to AGG_SORTED, the planner
  *   uses it to generate a sortnode.
+ *
+ * - add_path_fn:
+ *   the callback to add a path, PARTITIONWISE_AGGREGATE_PARTIAL type
+ *   may want to add a partial path to grouped_rel->pathlist, so we
+ *   cannot decide add_path functions base on the aggsplit.
  */
-static void
+static void 
 consider_groupingsets_paths(PlannerInfo *root,
 							RelOptInfo *grouped_rel,
 							Path *path,
@@ -4205,9 +4223,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
 							double dNumGroups,
-							AggStrategy strat)
+							List *havingQual,
+							AggStrategy strat,
+							AggSplit aggsplit,
+							AddPathCallback add_path_fn)
 {
-	Query	   *parse = root->parse;
 	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
@@ -4368,16 +4388,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			new_rollups = extract_final_rollups(root, gd, new_rollups);
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 strat,
+											 new_rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
 		return;
 	}
 
@@ -4389,7 +4413,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 	/*
 	 * Callers consider AGG_SORTED strategy, the first rollup must
-	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * use non-hashed aggregate, is_sorted tells whether the first
 	 * rollup need to do its own sort.
 	 *
 	 * we try and make two paths: one sorted and one mixed
@@ -4531,16 +4555,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups,
-											  is_sorted));
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+				rollups = extract_final_rollups(root, gd, rollups);
+
+			add_path_fn(grouped_rel, (Path *)
+						create_groupingsets_path(root,
+												 grouped_rel,
+												 path,
+												 havingQual,
+												 AGG_MIXED,
+												 rollups,
+												 agg_costs,
+												 dNumGroups,
+												 aggsplit,
+												 is_sorted));
 		}
 	}
 
@@ -4548,16 +4576,82 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+	{
+		List *rollups;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rollups = extract_final_rollups(root, gd, gd->rollups);
+		else
+			rollups = gd->rollups;
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 AGG_SORTED,
+											 rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
+	}
+}
+
+/* 
+ * If we are combining the partial groupingsets aggregation, the input is
+ * mixed with tuples from different grouping sets, executor dispatch the
+ * tuples to different rollups (phases) according to the grouping set id.
+ *
+ * We cannot use the same rollups with initial stage in which each tuple
+ * is processed by one or more grouping sets in one rollup, because in
+ * combining stage, each tuple only belong to one single grouping set.
+ * In this case, we use final_rollups instead in which each rollup has
+ * only one grouping set.
+ */
+static List *
+extract_final_rollups(PlannerInfo *root,
+					  grouping_sets_data *gd,
+					  List *rollups)
+{
+	ListCell	*lc;
+	List		*new_rollups = NIL;
+
+	foreach(lc, rollups)
+	{
+		ListCell	*lc1;
+		RollupData	*rollup = lfirst_node(RollupData, lc);
+
+		foreach(lc1, rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst_node(GroupingSetData, lc1);
+			RollupData *new_rollup = makeNode(RollupData);
+
+			if (gs->set != NIL)
+			{
+				new_rollup->groupClause = preprocess_groupclause(root, gs->set);
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = remap_to_groupclause_idx(new_rollup->groupClause,
+															 new_rollup->gsets_data,
+															 gd->tleref_to_colnum_map);
+				new_rollup->hashable = rollup->hashable;
+				new_rollup->is_hashed = rollup->is_hashed;
+			}
+			else
+			{
+				new_rollup->groupClause = NIL;
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = list_make1(NIL); 
+				new_rollup->hashable = false;
+				new_rollup->is_hashed = false;
+			}
+
+			new_rollup->numGroups = gs->numGroups;
+			new_rollups = lappend(new_rollups, new_rollup);
+		}
+	}
+
+	return new_rollups;
 }
 
 /*
@@ -5267,6 +5361,17 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
+	/* 
+	 * We are generate partial groupingsets path, add an expression to show
+	 * the grouping set ID for a tuple, so in the final stage, executor can
+	 * know which set this tuple belongs to and combine them.
+	 * */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
 	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
@@ -6430,7 +6535,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					consider_groupingsets_paths(root, grouped_rel,
 												path, is_sorted, can_hash,
 												gd, agg_costs, dNumGroups,
-												AGG_SORTED);
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_SIMPLE,
+												add_path);
 					continue;
 				}
 
@@ -6491,15 +6599,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
+
+				/*
+				 * Use any available suitably-sorted path as input, and also
+				 * consider sorting the cheapest-total path.
+				 */
+				if (path != partially_grouped_rel->cheapest_total_path &&
+					!is_sorted)
+					continue;
+
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_FINAL_DESERIAL,
+												add_path);
+					continue;
+				}
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
-					if (path != partially_grouped_rel->cheapest_total_path)
-						continue;
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6543,7 +6673,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
 										gd, agg_costs, dNumGroups,
-										AGG_HASHED);
+										havingQual,
+										AGG_HASHED,
+										AGGSPLIT_SIMPLE,
+										add_path);
 		}
 		else
 		{
@@ -6586,22 +6719,39 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGG_HASHED,
+											AGGSPLIT_FINAL_DESERIAL,
+											add_path);
+			}
+			else
+			{
 
-			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
+
+				if (hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6811,6 +6961,19 @@ create_partial_grouping_paths(PlannerInfo *root,
 											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGG_SORTED,
+												AGGSPLIT_INITIAL_SERIAL,
+												add_partial_path);
+					continue;
+				}
+
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6818,7 +6981,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 													 path,
 													 root->group_pathkeys,
 													 -1.0);
-
+				
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
 									 create_agg_path(root,
@@ -6880,26 +7043,41 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if (hashaggtablesize < work_mem * 1024L &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGG_HASHED,
+										AGGSPLIT_INITIAL_SERIAL,
+										add_partial_path);
+		}
+		else 
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if (hashaggtablesize < work_mem * 1024L &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6943,6 +7121,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	generate_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
+	if (root->parse->groupingSets)
+		return;
+
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
 							   cheapest_partial_path->pathkeys))
@@ -6987,11 +7168,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..eae7d15701 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -754,6 +754,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If it's groupingsets, we must add expression to evaluate
+					 * the grouping set ID and set the reference from the
+					 * targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->gsetid = (Expr *) fix_upper_expr(root, (Node *)expr,
+															  subplan_itlist,
+															  OUTER_VAR,
+															  rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2dfa3fa17e..5a92e4892e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2994,6 +2994,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups,
+						 AggSplit aggsplit,
 						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
@@ -3011,6 +3012,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit = aggsplit;
 	pathnode->is_sorted = is_sorted;
 
 	/*
@@ -3045,11 +3047,27 @@ create_groupingsets_path(PlannerInfo *root,
 	Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
 	Assert(aggstrategy != AGG_MIXED || list_length(rollups) > 1);
 
+	/*
+	 * Estimate the cost of groupingsets.
+	 *
+	 * If we are finalizing grouping sets, the subpath->rows
+	 * contains rows from all sets, we need to estimate the
+	 * number of rows in each rollup. Meanwhile, the cost of
+	 * preprocess groupingsets is not estimated, the expression
+	 * to redirect tuples is a simple Var expression which is
+	 * normally cost zero.
+	 */
 	foreach(lc, rollups)
 	{
 		RollupData *rollup = lfirst(lc);
 		List	   *gsets = rollup->gsets;
 		int			numGroupCols = list_length(linitial(gsets));
+		int			rows = 0.0;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rows = rollup->numGroups * subpath->rows / numGroups;
+		else
+			rows = subpath->rows;
 
 		/*
 		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
@@ -3071,7 +3089,7 @@ create_groupingsets_path(PlannerInfo *root,
 
 				cost_sort(&sort_path, root, NIL,
 						  input_total_cost,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3089,7 +3107,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 input_startup_cost,
 					 input_total_cost,
-					 subpath->rows);
+					 rows);
 
 			is_first = false;
 		}
@@ -3101,7 +3119,6 @@ create_groupingsets_path(PlannerInfo *root,
 			
 			sort_path.startup_cost = 0;
 			sort_path.total_cost = 0;
-			sort_path.rows = subpath->rows;
 
 			rollup_strategy = rollup->is_hashed ?
 				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
@@ -3111,7 +3128,7 @@ create_groupingsets_path(PlannerInfo *root,
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
 						  0.0,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3127,7 +3144,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 sort_path.startup_cost,
 					 sort_path.total_cost,
-					 sort_path.rows);
+					 rows);
 
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5e63238f03..5779d158ba 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -7941,6 +7941,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 4ed5d0a7de..4d36c2d77b 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index c5d4121c37..967af08af7 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -281,6 +281,8 @@ typedef struct AggStatePerPhaseData
 	List		*concurrent_hashes;	/* hash phases can do transition concurrently */
 	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
 	bool		skip_build_trans;
+#define FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS 10
+	int			*setno_gsetids;		/* setno <-> gsetid map */
 }			AggStatePerPhaseData;
 
 typedef struct AggStatePerPhaseSortData
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4081a0978e..dea5b10597 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2047,6 +2047,7 @@ typedef struct AggState
 	int			numtrans;		/* number of pertrans items */
 	AggStrategy aggstrategy;	/* strategy mode */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+#define FIELDNO_AGGSTATE_PHASE 6
 	AggStatePerPhase phase;		/* pointer to current phase data */
 	int			numphases;		/* number of phases (including phase 0) */
 	int			current_phase;	/* current phase number */
@@ -2070,8 +2071,6 @@ typedef struct AggState
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
 	AggStatePerPhase *phases;	/* array of all phases */
-	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
-	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
@@ -2083,6 +2082,11 @@ typedef struct AggState
 	int			eflags;			/* eflags for the first sort */
 
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* these field are used in parallel grouping sets */
+	bool		groupingsets_preprocess; /* groupingsets preprocessed yet? */
+	ExprState	*gsetid;				/* expression state to get grpsetid from input */
+	int			*gsetid_phaseidxs;	/* grpsetid <-> phaseidx mapping */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index baced7eec0..31f7cd1ff7 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c1e69c808f..2761fa6d01 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1676,6 +1676,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			setId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1702,6 +1703,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3cd2537e9e..5b1239adf2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -816,8 +817,9 @@ typedef struct Agg
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Expr	   *gsetid;			/* expression to fetch grouping set id */
 	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d73be2ad46..f8f85d431a 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -364,6 +364,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* GroupingSetId */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f9f388ba06..4fde8b22bf 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups,
+												  AggSplit aggsplit,
 												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5954ff3997..e987011328 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 RollupData *rollup, List *chain, double dNumGroups,
 					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
-- 
2.14.1

#24

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Pengzhou Tang (#23)

7 attachment(s)

Re: Parallel grouping sets

Hi,

unfortunately this got a bit broken by the disk-based hash aggregation,
committed today, and so it needs a rebase. I've started looking at the
patch before that, and I have it rebased on e00912e11a9e (i.e. the
commit before the one that breaks it).

Attached is the rebased patch series (now broken), with a couple of
commits with some minor cosmetic changes I propose to make (easier than
explaining it on a list, it's mostly about whitespace, comments etc).
Feel free to reject the changes, it's up to you.

I'll continue doing the review, but it'd be good to have a fully rebased
version.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-All-grouping-sets-do-their-own-sorting.patchtext/plain; charset=iso-8859-1Download

From 7f932c2a2897a92f261b5ccfaea2c2b90823996c Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:07:29 -0400
Subject: [PATCH 1/7] All grouping sets do their own sorting
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PG used to add a SORT path explicitly beneathÂ the AGG for sort aggregate,
grouping sets path also add a SORT path for the first sort aggregate phase,
but the following sort aggregate phases do their own sorting using a tuplesort.
This commit unified the way how grouping sets path doing sort, all sort aggregate
phases now do their own sorting using tuplesort.

This commit is mainly a preparing step to support parallel grouping sets, the
main idea of parallel grouping sets is: like parallel aggregate,Â  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multipleÂ grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

TheÂ final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizationsÂ in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.

Obviously, adding a SORT path underneath the AGG in the final stage is not
right. This commit can avoid it and all non-hashed aggregate phases can do
their own sorting after thetuples are redirected.
---
 src/backend/commands/explain.c             |   5 +-
 src/backend/executor/nodeAgg.c             |  79 +++++++++++--
 src/backend/nodes/copyfuncs.c              |   1 +
 src/backend/nodes/outfuncs.c               |   1 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/createplan.c    |  65 ++++++++---
 src/backend/optimizer/plan/planner.c       |  66 +++++++----
 src/backend/optimizer/util/pathnode.c      |  30 ++++-
 src/include/executor/nodeAgg.h             |   2 -
 src/include/nodes/execnodes.h              |   5 +-
 src/include/nodes/pathnodes.h              |   1 +
 src/include/nodes/plannodes.h              |   1 +
 src/include/optimizer/pathnode.h           |   3 +-
 src/include/optimizer/planmain.h           |   2 +-
 src/test/regress/expected/groupingsets.out | 130 ++++++++++-----------
 15 files changed, 260 insertions(+), 132 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d901dc4a50..b1609b339a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2289,15 +2289,14 @@ show_grouping_sets(PlanState *planstate, Agg *agg,
 
 	ExplainOpenGroup("Grouping Sets", "Grouping Sets", false, es);
 
-	show_grouping_set_keys(planstate, agg, NULL,
+	show_grouping_set_keys(planstate, agg, (Sort *) agg->sortnode,
 						   context, useprefix, ancestors, es);
 
 	foreach(lc, agg->chain)
 	{
 		Agg		   *aggnode = lfirst(lc);
-		Sort	   *sortnode = (Sort *) aggnode->plan.lefttree;
 
-		show_grouping_set_keys(planstate, aggnode, sortnode,
+		show_grouping_set_keys(planstate, aggnode, (Sort *) aggnode->sortnode,
 							   context, useprefix, ancestors, es);
 	}
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 7aebb247d8..b4f53bf77a 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -278,6 +278,7 @@ static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
 static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
@@ -367,7 +368,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
+		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -1594,6 +1595,8 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (!node->input_sorted)
+					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -1945,6 +1948,45 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static void
+agg_sort_input(AggState *aggstate)
+{
+	AggStatePerPhase phase = &aggstate->phases[1];
+	TupleDesc	tupDesc;
+	Sort		*sortnode;
+
+	Assert(!aggstate->input_sorted);
+	Assert(phase->aggnode->sortnode);
+
+	sortnode = (Sort *) phase->aggnode->sortnode;
+	tupDesc = ExecGetResultType(outerPlanState(aggstate));
+
+	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
+											 sortnode->numCols,
+											 sortnode->sortColIdx,
+											 sortnode->sortOperators,
+											 sortnode->collations,
+											 sortnode->nullsFirst,
+											 work_mem,
+											 NULL, false);
+	for (;;)
+	{
+		TupleTableSlot *outerslot;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+	}
+
+	/* Sort the first phase */
+	tuplesort_performsort(aggstate->sort_in);
+
+	/* Mark the input to be sorted */
+	aggstate->input_sorted = true;
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -2127,6 +2169,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
+	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
@@ -2171,6 +2214,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_sorted = true;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -2178,6 +2222,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numPhases = (use_hashing ? 1 : 2);
 	numHashes = (use_hashing ? 1 : 0);
 
+	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.  Also calculate the number of
@@ -2199,7 +2245,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * others add an extra phase.
 			 */
 			if (agg->aggstrategy != AGG_HASHED)
+			{
 				++numPhases;
+
+				if (!firstSortAgg)
+					firstSortAgg = agg;
+
+			}
 			else
 				++numHashes;
 		}
@@ -2208,6 +2260,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = numPhases;
 
+	/*
+	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * agg_sort_input(), this can only happen in groupingsets case.
+	 */
+	if (firstSortAgg && firstSortAgg->sortnode)
+		aggstate->input_sorted = false;	
+
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
@@ -2269,7 +2328,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		!aggstate->input_sorted)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2340,20 +2400,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
 	{
 		Agg		   *aggnode;
-		Sort	   *sortnode;
 
 		if (phaseidx > 0)
-		{
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
-			sortnode = castNode(Sort, aggnode->plan.lefttree);
-		}
 		else
-		{
 			aggnode = node;
-			sortnode = NULL;
-		}
-
-		Assert(phase <= 1 || sortnode);
 
 		if (aggnode->aggstrategy == AGG_HASHED
 			|| aggnode->aggstrategy == AGG_MIXED)
@@ -2470,7 +2521,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
-			phasedata->sortnode = sortnode;
 		}
 	}
 
@@ -3559,6 +3609,10 @@ ExecReScanAgg(AggState *node)
 				   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
+		/* Reset input_sorted */
+		if (aggnode->sortnode)
+			node->input_sorted = false;
+
 		/* reset to phase 1 */
 		initialize_phase(node, 1);
 
@@ -3566,6 +3620,7 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
+
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..04b4c65858 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -992,6 +992,7 @@ _copyAgg(const Agg *from)
 	COPY_BITMAPSET_FIELD(aggParams);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..5816d122c1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -787,6 +787,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_BITMAPSET_FIELD(aggParams);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(sortnode);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..af4fcfe1ee 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2207,6 +2207,7 @@ _readAgg(void)
 	READ_BITMAPSET_FIELD(aggParams);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..d5b34089aa 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1645,6 +1645,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 NIL,
 								 best_path->path.rows,
 								 0,
+								 NULL,
 								 subplan);
 	}
 	else
@@ -2098,6 +2099,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
+					NULL,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2159,6 +2161,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
 	int			maxref;
+	int			flags = CP_LABEL_TLIST;
 	List	   *chain;
 	ListCell   *lc;
 
@@ -2168,9 +2171,15 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available
+	 * we do need grouping columns to be available; If the groupingsets need
+	 * to sort the input, the agg will store the input rows in a tuplesort,
+	 * it therefore behooves us to request a small tlist to avoid wasting
+	 * spaces.
 	 */
-	subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST);
+	if (!best_path->is_sorted)
+		flags = flags | CP_SMALL_TLIST;
+
+	subplan = create_plan_recurse(root, best_path->subpath, flags);
 
 	/*
 	 * Compute the mapping from tleSortGroupRef to column index in the child's
@@ -2230,12 +2239,22 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-			if (!rollup->is_hashed && !is_first_sort)
+			if (!rollup->is_hashed)
 			{
-				sort_plan = (Plan *)
-					make_sort_from_groupcols(rollup->groupClause,
-											 new_grpColIdx,
-											 subplan);
+				if (!is_first_sort ||
+					(is_first_sort && !best_path->is_sorted))
+				{
+					sort_plan = (Plan *)
+						make_sort_from_groupcols(rollup->groupClause,
+												 new_grpColIdx,
+												 subplan);
+
+					/*
+					 * Remove stuff we don't need to avoid bloating debug output.
+					 */
+					sort_plan->targetlist = NIL;
+					sort_plan->lefttree = NULL;
+				}
 			}
 
 			if (!rollup->is_hashed)
@@ -2260,16 +2279,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
-										 sort_plan);
-
-			/*
-			 * Remove stuff we don't need to avoid bloating debug output.
-			 */
-			if (sort_plan)
-			{
-				sort_plan->targetlist = NIL;
-				sort_plan->lefttree = NULL;
-			}
+										 sort_plan,
+										 NULL);
 
 			chain = lappend(chain, agg_plan);
 		}
@@ -2281,10 +2292,26 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
+		Plan	   *sort_plan = NULL;
 		int			numGroupCols;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+		/* the input is not sorted yet */
+		if (!rollup->is_hashed &&
+			!best_path->is_sorted)
+		{
+			sort_plan = (Plan *)
+				make_sort_from_groupcols(rollup->groupClause,
+										 top_grpColIdx,
+										 subplan);
+			/*
+			 * Remove stuff we don't need to avoid bloating debug output.
+			 */
+			sort_plan->targetlist = NIL;
+			sort_plan->lefttree = NULL;
+		}
+
 		numGroupCols = list_length((List *) linitial(rollup->gsets));
 
 		plan = make_agg(build_path_tlist(root, &best_path->path),
@@ -2299,6 +2326,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
+						sort_plan,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
@@ -6197,7 +6225,7 @@ make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 		 List *groupingSets, List *chain, double dNumGroups,
-		 Size transitionSpace, Plan *lefttree)
+		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
 	Plan	   *plan = &node->plan;
@@ -6217,6 +6245,7 @@ make_agg(List *tlist, List *qual,
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
 	node->groupingSets = groupingSets;
 	node->chain = chain;
+	node->sortnode = sortnode;
 
 	plan->qual = qual;
 	plan->targetlist = tlist;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b44efd6314..82a15761b4 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -175,7 +175,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggStrategy strat);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -4186,6 +4187,14 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * it, by combinations of hashing and sorting.  This can be called multiple
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
+ *
+ * - strat:
+ *   preferred aggregate strategy to use.
+ * 
+ * - is_sorted:
+ *   Is the input sorted on the groupCols of the first rollup. Caller
+ *   must set it correctly if strat is set to AGG_SORTED, the planner
+ *   uses it to generate a sortnode.
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4195,13 +4204,15 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggStrategy strat)
 {
 	Query	   *parse = root->parse;
+	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
-	 * If we're not being offered sorted input, then only consider plans that
-	 * can be done entirely by hashing.
+	 * If strat is AGG_HASHED, then only consider plans that can be done
+	 * entirely by hashing.
 	 *
 	 * We can hash everything if it looks like it'll fit in work_mem. But if
 	 * the input is actually sorted despite not being advertised as such, we
@@ -4210,7 +4221,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * If none of the grouping sets are sortable, then ignore the work_mem
 	 * limit and generate a path anyway, since otherwise we'll just fail.
 	 */
-	if (!is_sorted)
+	if (strat == AGG_HASHED)
 	{
 		List	   *new_rollups = NIL;
 		RollupData *unhashed_rollup = NULL;
@@ -4251,6 +4262,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
+			/* update is_sorted to true */
+			is_sorted = true;
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4348,6 +4361,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
+			/* update is_sorted to true */
+			is_sorted = true;
 			strat = AGG_MIXED;
 		}
 
@@ -4359,18 +4374,23 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  strat,
 										  new_rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 		return;
 	}
 
 	/*
-	 * If we have sorted input but nothing we can do with it, bail.
+	 * Strategy is AGG_SORTED but nothing we can do with it, bail.
 	 */
 	if (list_length(gd->rollups) == 0)
 		return;
 
 	/*
-	 * Given sorted input, we try and make two paths: one sorted and one mixed
+	 * Callers consider AGG_SORTED strategy, the first rollup must
+	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * rollup need to do its own sort.
+	 *
+	 * we try and make two paths: one sorted and one mixed
 	 * sort/hash. (We need to try both because hashagg might be disabled, or
 	 * some columns might not be sortable.)
 	 *
@@ -4427,7 +4447,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that matches the input sort order.  We assign indexes "i"
+			 * one that need to be sorted.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -4516,7 +4536,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  AGG_MIXED,
 											  rollups,
 											  agg_costs,
-											  dNumGroups));
+											  dNumGroups,
+											  is_sorted));
 		}
 	}
 
@@ -4532,7 +4553,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  AGG_SORTED,
 										  gd->rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 }
 
 /*
@@ -6399,6 +6421,16 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					/* consider AGG_SORTED strategy */
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_costs, dNumGroups,
+												AGG_SORTED);
+					continue;
+				}
+
 				/* Sort the cheapest-total path if it isn't already sorted */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6407,14 +6439,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 root->group_pathkeys,
 													 -1.0);
 
-				/* Now decide what to stick atop it */
-				if (parse->groupingSets)
-				{
-					consider_groupingsets_paths(root, grouped_rel,
-												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
-				}
-				else if (parse->hasAggs)
+				if (parse->hasAggs)
 				{
 					/*
 					 * We have aggregation, possibly with plain GROUP BY. Make
@@ -6514,7 +6539,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										AGG_HASHED);
 		}
 		else
 		{
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index d9ce516211..0feb3363d3 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2983,6 +2983,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
+ * 'is_sorted' is the input sorted in the group cols of first rollup
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -2992,7 +2993,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3010,6 +3012,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->is_sorted = is_sorted;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
@@ -3061,14 +3064,33 @@ create_groupingsets_path(PlannerInfo *root,
 		 */
 		if (is_first)
 		{
+			Cost	input_startup_cost = subpath->startup_cost;
+			Cost	input_total_cost = subpath->total_cost;
+
+			if (!rollup->is_hashed && !is_sorted && numGroupCols)
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				cost_sort(&sort_path, root, NIL,
+						  input_total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0);
+
+				input_startup_cost = sort_path.startup_cost;
+				input_total_cost = sort_path.total_cost;
+			}
+
 			cost_agg(&pathnode->path, root,
 					 aggstrategy,
 					 agg_costs,
 					 numGroupCols,
 					 rollup->numGroups,
 					 having_qual,
-					 subpath->startup_cost,
-					 subpath->total_cost,
+					 input_startup_cost,
+					 input_total_cost,
 					 subpath->rows);
 			is_first = false;
 			if (!rollup->is_hashed)
@@ -3079,7 +3101,7 @@ create_groupingsets_path(PlannerInfo *root,
 			Path		sort_path;	/* dummy for result of cost_sort */
 			Path		agg_path;	/* dummy for result of cost_agg */
 
-			if (rollup->is_hashed || is_first_sort)
+			if (rollup->is_hashed || (is_first_sort && is_sorted))
 			{
 				/*
 				 * Account for cost of aggregation, but don't charge input
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 264916f9a9..66a83b9ac9 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -277,8 +277,6 @@ typedef struct AggStatePerPhaseData
 	ExprState **eqfunctions;	/* expression returning equality, indexed by
 								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
-	Sort	   *sortnode;		/* Sort node for input ordering for phase */
-
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
 }			AggStatePerPhaseData;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index cd3ddf781f..5e33a368f5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2083,8 +2083,11 @@ typedef struct AggState
 	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
 										 * per-group pointers */
 
+	/* these fields are used in AGG_SORTED and AGG_MIXED */
+	bool		input_sorted;	/* hash table filled yet? */
+
 	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 34
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 35
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..c1e69c808f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1702,6 +1702,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..3cd2537e9e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -818,6 +818,7 @@ typedef struct Agg
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
 	List	   *groupingSets;	/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
 /* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..f9f388ba06 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 4781201001..5954ff3997 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 					 List *groupingSets, List *chain, double dNumGroups,
-					 Size transitionSpace, Plan *lefttree);
+					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
 /*
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index c1f802c88a..12425f46ca 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -366,15 +366,14 @@ explain (costs off)
 select g as alias1, g as alias2
   from generate_series(1,3) g
  group by alias1, rollup(alias2);
-                   QUERY PLAN                   
-------------------------------------------------
+                QUERY PLAN                
+------------------------------------------
  GroupAggregate
-   Group Key: g, g
-   Group Key: g
-   ->  Sort
-         Sort Key: g
-         ->  Function Scan on generate_series g
-(6 rows)
+   Sort Key: g, g
+     Group Key: g, g
+     Group Key: g
+   ->  Function Scan on generate_series g
+(5 rows)
 
 select g as alias1, g as alias2
   from generate_series(1,3) g
@@ -640,15 +639,14 @@ select a, b, sum(v.x)
 -- Test reordering of grouping sets
 explain (costs off)
 select * from gstest1 group by grouping sets((a,b,v),(v)) order by v,b,a;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-   Group Key: "*VALUES*".column3
-   ->  Sort
-         Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-         ->  Values Scan on "*VALUES*"
-(6 rows)
+   Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3
+   ->  Values Scan on "*VALUES*"
+(5 rows)
 
 -- Agg level check. This query should error out.
 select (select grouping(a,b) from gstest2) from gstest2 group by a,b;
@@ -723,13 +721,12 @@ explain (costs off)
             QUERY PLAN            
 ----------------------------------
  GroupAggregate
-   Group Key: a
-   Group Key: ()
+   Sort Key: a
+     Group Key: a
+     Group Key: ()
    Filter: (a IS DISTINCT FROM 1)
-   ->  Sort
-         Sort Key: a
-         ->  Seq Scan on gstest2
-(7 rows)
+   ->  Seq Scan on gstest2
+(6 rows)
 
 select v.c, (select count(*) from gstest2 group by () having v.c)
   from (values (false),(true)) v(c) order by v.c;
@@ -1018,18 +1015,17 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
 explain (costs off)
   select a, b, grouping(a,b), array_agg(v order by v)
     from gstest1 group by cube(a,b);
-                        QUERY PLAN                        
-----------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column1, "*VALUES*".column2
-   Group Key: "*VALUES*".column1
-   Group Key: ()
+   Sort Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1
+     Group Key: ()
    Sort Key: "*VALUES*".column2
      Group Key: "*VALUES*".column2
-   ->  Sort
-         Sort Key: "*VALUES*".column1, "*VALUES*".column2
-         ->  Values Scan on "*VALUES*"
-(9 rows)
+   ->  Values Scan on "*VALUES*"
+(8 rows)
 
 -- unsortable cases
 select unsortable_col, count(*)
@@ -1071,11 +1067,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: unsortable_col
-         Group Key: unhashable_col
-         ->  Sort
-               Sort Key: unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: unhashable_col
+           Group Key: unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 select unhashable_col, unsortable_col,
        grouping(unhashable_col, unsortable_col),
@@ -1114,11 +1109,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: v, unsortable_col
-         Group Key: v, unhashable_col
-         ->  Sort
-               Sort Key: v, unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: v, unhashable_col
+           Group Key: v, unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 -- empty input: first is 0 rows, second 1, third 3 etc.
 select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a);
@@ -1366,19 +1360,18 @@ explain (costs off)
 BEGIN;
 SET LOCAL enable_hashagg = false;
 EXPLAIN (COSTS OFF) SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
-              QUERY PLAN               
----------------------------------------
+           QUERY PLAN            
+---------------------------------
  Sort
    Sort Key: a, b
    ->  GroupAggregate
-         Group Key: a
-         Group Key: ()
+         Sort Key: a
+           Group Key: a
+           Group Key: ()
          Sort Key: b
            Group Key: b
-         ->  Sort
-               Sort Key: a
-               ->  Seq Scan on gstest3
-(10 rows)
+         ->  Seq Scan on gstest3
+(9 rows)
 
 SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
  a | b | count | max | max 
@@ -1549,22 +1542,21 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(13 rows)
+   ->  Seq Scan on tenk1
+(12 rows)
 
 explain (costs off)
   select unique1,
@@ -1572,18 +1564,17 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+       QUERY PLAN        
+-------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(9 rows)
+   Sort Key: unique1
+     Group Key: unique1
+   ->  Seq Scan on tenk1
+(8 rows)
 
 set work_mem = '384kB';
 explain (costs off)
@@ -1592,21 +1583,20 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
    Hash Key: thousand
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(12 rows)
+   ->  Seq Scan on tenk1
+(11 rows)
 
 -- check collation-sensitive matching between grouping expressions
 -- (similar to a check for aggregates, but there are additional code
-- 
2.21.1

0002-fixes.patchtext/plain; charset=us-asciiDownload

From 57f30d60158398bd57f02ef00800d3f63c6ab12f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 01:20:58 +0100
Subject: [PATCH 2/7] fixes

---
 src/backend/executor/nodeAgg.c          |  3 +--
 src/backend/optimizer/plan/createplan.c | 14 ++++++++++----
 src/backend/optimizer/plan/planner.c    | 23 +++++++++++++----------
 src/test/modules/unsafe_tests/Makefile  |  2 +-
 4 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b4f53bf77a..c3a043e448 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -368,7 +368,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
+		Sort	   *sortnode = (Sort *) aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -3620,7 +3620,6 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
-
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d5b34089aa..044ec92aa8 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2171,10 +2171,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available; If the groupingsets need
+	 * we do need grouping columns to be available. If the groupingsets need
 	 * to sort the input, the agg will store the input rows in a tuplesort,
-	 * it therefore behooves us to request a small tlist to avoid wasting
-	 * spaces.
+	 * so we request a small tlist to avoid wasting space.
 	 */
 	if (!best_path->is_sorted)
 		flags = flags | CP_SMALL_TLIST;
@@ -2239,6 +2238,10 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+			/*
+			 * FIXME This combination of nested if checks needs some explanation
+			 * why we need this particular combination of flags.
+			 */
 			if (!rollup->is_hashed)
 			{
 				if (!is_first_sort ||
@@ -2297,7 +2300,10 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-		/* the input is not sorted yet */
+		/*
+		 * When the rollup uses sorted mode, and the input is not already sorted,
+		 * add an explicit sort.
+		 */
 		if (!rollup->is_hashed &&
 			!best_path->is_sorted)
 		{
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 82a15761b4..fc6a1d0044 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4188,13 +4188,9 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
  *
- * - strat:
- *   preferred aggregate strategy to use.
- * 
- * - is_sorted:
- *   Is the input sorted on the groupCols of the first rollup. Caller
- *   must set it correctly if strat is set to AGG_SORTED, the planner
- *   uses it to generate a sortnode.
+ * The caller specifies the preferred aggregate strategy (sorted or hashed) using
+ * the strat aprameter. When the requested strategy is AGG_SORTED, the input path
+ * needs to be sorted accordingly (is_sorted needs to be true).
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4262,7 +4258,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
-			/* update is_sorted to true */
+			/* update is_sorted to true
+			 * XXX why? shouldn't it be already set by the caller? */
 			is_sorted = true;
 		}
 
@@ -4361,7 +4358,9 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
-			/* update is_sorted to true */
+			/* update is_sorted to true
+			 * XXX why? shouldn't it be already set by the caller?
+			 */
 			is_sorted = true;
 			strat = AGG_MIXED;
 		}
@@ -4396,6 +4395,9 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 *
 	 * can_hash is passed in as false if some obstacle elsewhere (such as
 	 * ordered aggs) means that we shouldn't consider hashing at all.
+	 *
+	 * XXX This comment seems to be broken by the patch, and it's not very
+	 * clear to me what it tries to say.
 	 */
 	if (can_hash && gd->any_hashable)
 	{
@@ -4447,7 +4449,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that need to be sorted.  We assign indexes "i"
+			 * one that matches the input sort order.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -6421,6 +6423,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				/* XXX Why do we do it before possibly adding an explicit sort on top? */
 				if (parse->groupingSets)
 				{
 					/* consider AGG_SORTED strategy */
diff --git a/src/test/modules/unsafe_tests/Makefile b/src/test/modules/unsafe_tests/Makefile
index 3ecf5fcfc5..2cf710eb2c 100644
--- a/src/test/modules/unsafe_tests/Makefile
+++ b/src/test/modules/unsafe_tests/Makefile
@@ -1,6 +1,6 @@
 # src/test/modules/unsafe_tests/Makefile
 
-REGRESS = rolenames alter_system_table
+REGRESS = alter_system_table
 
 ifdef USE_PGXS
 PG_CONFIG = pg_config
-- 
2.21.1

0003-fix-a-numtrans-bug.patchtext/plain; charset=us-asciiDownload

From d765999cd2446fa05ad9b53a5e87dd5480a9b55c Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Thu, 12 Mar 2020 04:38:36 -0400
Subject: [PATCH 3/7] fix a numtrans bug

aggstate->numtrans is always zero when building the hash table for
hash aggregates, this make the additional size of hash table not
correct.
---
 src/backend/executor/nodeAgg.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index c3a043e448..f7af5eebd0 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2570,10 +2570,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	{
 		/* this is an array of pointers, not structures */
 		aggstate->hash_pergroup = pergroups;
-
-		find_hash_columns(aggstate);
-		build_hash_tables(aggstate);
-		aggstate->table_filled = false;
 	}
 
 	/*
@@ -2929,6 +2925,14 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	/* Initialize hash tables for hash aggregates */
+	if (use_hashing)
+	{
+		find_hash_columns(aggstate);
+		build_hash_tables(aggstate);
+		aggstate->table_filled = false;
+	}
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
-- 
2.21.1

0004-Reorganise-the-aggregate-phases.patchtext/plain; charset=iso-8859-1Download

From 62ea6872e4da65f03dd93f977b62ab483e8b46f7 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 01:35:48 +0100
Subject: [PATCH 4/7] Reorganise the aggregate phases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This commit is a preparing step to support parallel grouping sets.

When planning, PG used to organize the grouping sets in [HASHED] -> [SORTED]
order which means HASHED aggregates were always located before SORTED aggregate,
when initializing AGG node, PG also organized the aggregate phases in
[HASHED]->[SORTED] order, all HASHED grouping sets were squeezed to the phase 0,
when executing AGG node, if followed AGG_SORTED or AGG_MIXED strategy, the
executor will start from phase1 ->Â phases2-> phases3Â then phase0 if it is an
AGG_MIXED strategy. This bothers a lot when adding the support for parallel
grouping sets, firstly, we need complicated logic to locate the first sort
rollup/phase and handle the special order for a different strategy in many
places, Secondly,Â squeezing all hashed grouping sets to phase 0 is not working
for parallel grouping sets, we can not put all hash transition functions to one
expression state in the final stage.

This commit organizes the grouping sets in a more natural order: [SORTED]->[HASHED]
and the HASHED sets are no longer squeezed to a single phase, we use another way
to put all hash transitions to the first phase's expression state, the executor
now starts execution from phase0 for all strategies.

This commit also move 'sort_in' from AggStateÂ to AggStatePerPhase* structure,
this helps to handle more complicated cases when parallel groupingsets is
introduced, we might need to add a tuplestore 'store_in' to store partial
aggregates results for PLAIN sets then.
---
 src/backend/commands/explain.c                |   2 +-
 src/backend/executor/execExpr.c               |  58 +-
 src/backend/executor/execExprInterp.c         |  30 +-
 src/backend/executor/nodeAgg.c                | 718 +++++++++---------
 src/backend/jit/llvm/llvmjit_expr.c           |  51 +-
 src/backend/optimizer/plan/createplan.c       |  29 +-
 src/backend/optimizer/plan/planner.c          |   9 +-
 src/backend/optimizer/util/pathnode.c         |  71 +-
 src/include/executor/execExpr.h               |   5 +-
 src/include/executor/executor.h               |   2 +-
 src/include/executor/nodeAgg.h                |  26 +-
 src/include/nodes/execnodes.h                 |  18 +-
 src/test/regress/expected/groupingsets.out    |  38 +-
 .../regress/expected/partition_aggregate.out  |   2 +-
 14 files changed, 529 insertions(+), 530 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b1609b339a..2c63cdb46c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2317,7 +2317,7 @@ show_grouping_set_keys(PlanState *planstate,
 	const char *keyname;
 	const char *keysetname;
 
-	if (aggnode->aggstrategy == AGG_HASHED || aggnode->aggstrategy == AGG_MIXED)
+	if (aggnode->aggstrategy == AGG_HASHED)
 	{
 		keyname = "Hash Key";
 		keysetname = "Hash Keys";
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 1370ffec50..07789501f7 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -80,7 +80,7 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-								  int transno, int setno, int setoff, bool ishash,
+								  int transno, int setno, AggStatePerPhase perphase,
 								  bool nullcheck);
 
 
@@ -2931,13 +2931,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  * the array of AggStatePerGroup, and skip evaluation if so.
  */
 ExprState *
-ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
+	ListCell	*lc;
 	LastAttnumInfo deform = {0, 0, 0};
 
 	state->expr = (Expr *) aggstate;
@@ -3155,38 +3155,24 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * grouping set). Do so for both sort and hash based computations, as
 		 * applicable.
 		 */
-		if (doSort)
+		for (int setno = 0; setno < phase->numsets; setno++)
 		{
-			int			processGroupingSets = Max(phase->numsets, 1);
-			int			setoff = 0;
-
-			for (int setno = 0; setno < processGroupingSets; setno++)
-			{
-				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
-				setoff++;
-			}
+			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
+								  pertrans, transno, setno, phase, nullcheck);
 		}
 
-		if (doHash)
+		/*
+		 * Call transition function for HASHED aggs that can be
+		 * advanced concurrently.
+		 */
+		foreach(lc, phase->concurrent_hashes)
 		{
-			int			numHashes = aggstate->num_hashes;
-			int			setoff;
+			AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
 
-			/* in MIXED mode, there'll be preceding transition values */
-			if (aggstate->aggstrategy != AGG_HASHED)
-				setoff = aggstate->maxsets;
-			else
-				setoff = 0;
-
-			for (int setno = 0; setno < numHashes; setno++)
-			{
-				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, true,
-									  nullcheck);
-				setoff++;
-			}
+			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
+								  pertrans, transno, 0,
+								  (AggStatePerPhase) perhash,
+								  nullcheck);
 		}
 
 		/* adjust early bail out jump target(s) */
@@ -3234,14 +3220,17 @@ static void
 ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-					  int transno, int setno, int setoff, bool ishash,
+					  int transno, int setno, AggStatePerPhase perphase,
 					  bool nullcheck)
 {
 	ExprContext *aggcontext;
 	int adjust_jumpnull = -1;
 
-	if (ishash)
+	if (perphase->is_hashed)
+	{
+		Assert(setno == 0);
 		aggcontext = aggstate->hashcontext;
+	}
 	else
 		aggcontext = aggstate->aggcontexts[setno];
 
@@ -3249,9 +3238,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	if (nullcheck)
 	{
 		scratch->opcode = EEOP_AGG_PLAIN_PERGROUP_NULLCHECK;
-		scratch->d.agg_plain_pergroup_nullcheck.setoff = setoff;
+		scratch->d.agg_plain_pergroup_nullcheck.pergroups = perphase->pergroups;
 		/* adjust later */
 		scratch->d.agg_plain_pergroup_nullcheck.jumpnull = -1;
+		scratch->d.agg_plain_pergroup_nullcheck.setno = setno;
 		ExprEvalPushStep(state, scratch);
 		adjust_jumpnull = state->steps_len - 1;
 	}
@@ -3319,7 +3309,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 
 	scratch->d.agg_trans.pertrans = pertrans;
 	scratch->d.agg_trans.setno = setno;
-	scratch->d.agg_trans.setoff = setoff;
+	scratch->d.agg_trans.pergroups = perphase->pergroups;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
 	ExprEvalPushStep(state, scratch);
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 113ed1547c..b0dbba4e55 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1610,9 +1610,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 
 		EEO_CASE(EEOP_AGG_PLAIN_PERGROUP_NULLCHECK)
 		{
-			AggState   *aggstate = castNode(AggState, state->parent);
-			AggStatePerGroup pergroup_allaggs = aggstate->all_pergroups
-				[op->d.agg_plain_pergroup_nullcheck.setoff];
+			AggStatePerGroup pergroup_allaggs =
+				op->d.agg_plain_pergroup_nullcheck.pergroups
+				[op->d.agg_plain_pergroup_nullcheck.setno];
 
 			if (pergroup_allaggs == NULL)
 				EEO_JUMP(op->d.agg_plain_pergroup_nullcheck.jumpnull);
@@ -1636,8 +1636,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1665,8 +1665,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1684,8 +1684,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1702,8 +1702,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1724,8 +1724,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1742,8 +1742,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f7af5eebd0..20c5eb98b3 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -227,6 +227,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -263,7 +264,7 @@ static void finalize_partialaggregate(AggState *aggstate,
 									  AggStatePerAgg peragg,
 									  AggStatePerGroup pergroupstate,
 									  Datum *resultVal, bool *resultIsNull);
-static void prepare_hash_slot(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash);
 static void prepare_projection_slot(AggState *aggstate,
 									TupleTableSlot *slot,
 									int currentSet);
@@ -274,12 +275,13 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
-static void lookup_hash_entries(AggState *aggstate);
+static void build_hash_table(AggState *aggstate, AggStatePerPhaseHash perhash);
+static void lookup_hash_entries(AggState *aggstate, List *perhashes);
+static void lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash,
+							  uint32 hash);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
@@ -310,7 +312,10 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 	 * ExecAggPlainTransByRef().
 	 */
 	if (is_hash)
+	{
+		Assert(setno == 0);
 		aggstate->curaggcontext = aggstate->hashcontext;
+	}
 	else
 		aggstate->curaggcontext = aggstate->aggcontexts[setno];
 
@@ -318,72 +323,75 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 }
 
 /*
- * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
+ * Switch to phase "newphase", which must either be 0 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
 {
-	Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);
+	AggStatePerPhase current_phase;
+	AggStatePerPhaseSort persort;
+
+	Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
+	
+	/* Don't use aggstate->phase here, it might not be initialized yet*/
+	current_phase = aggstate->phases[aggstate->current_phase];
 
 	/*
 	 * Whatever the previous state, we're now done with whatever input
-	 * tuplesort was in use.
+	 * tuplesort was in use, cleanup them.
+	 *
+	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * rescan in some cases without resorting the input again.
 	 */
-	if (aggstate->sort_in)
-	{
-		tuplesort_end(aggstate->sort_in);
-		aggstate->sort_in = NULL;
-	}
-
-	if (newphase <= 1)
+	if (!current_phase->is_hashed && aggstate->current_phase > 0)
 	{
-		/*
-		 * Discard any existing output tuplesort.
-		 */
-		if (aggstate->sort_out)
+		persort = (AggStatePerPhaseSort) current_phase;
+		if (persort->sort_in)
 		{
-			tuplesort_end(aggstate->sort_out);
-			aggstate->sort_out = NULL;
+			tuplesort_end(persort->sort_in);
+			persort->sort_in = NULL;
 		}
 	}
-	else
-	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
-		 */
-		aggstate->sort_in = aggstate->sort_out;
-		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
-	}
+
+	/* advance to next phase */
+	aggstate->current_phase = newphase;
+	aggstate->phase = aggstate->phases[newphase];
+
+	if (aggstate->phase->is_hashed)
+		return;
+
+	/* New phase is not hashed */
+	persort = (AggStatePerPhaseSort) aggstate->phase;
+
+	/* This is the right time to actually sort it. */
+	if (persort->sort_in)
+		tuplesort_performsort(persort->sort_in);
 
 	/*
-	 * If this isn't the last phase, we need to sort appropriately for the
+	 * If copy_out is set, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
-	{
-		Sort	   *sortnode = (Sort *) aggstate->phases[newphase + 1].aggnode->sortnode;
-		PlanState  *outerNode = outerPlanState(aggstate);
-		TupleDesc	tupDesc = ExecGetResultType(outerNode);
-
-		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
-												  sortnode->numCols,
-												  sortnode->sortColIdx,
-												  sortnode->sortOperators,
-												  sortnode->collations,
-												  sortnode->nullsFirst,
-												  work_mem,
-												  NULL, false);
+	if (persort->copy_out)
+	{
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[newphase + 1];
+		Sort *sortnode = (Sort *) next->phasedata.aggnode->sortnode;
+		PlanState *outerNode = outerPlanState(aggstate);
+		TupleDesc tupDesc = ExecGetResultType(outerNode);
+
+		Assert(!next->phasedata.is_hashed);
+
+		if (!next->sort_in)
+			next->sort_in = tuplesort_begin_heap(tupDesc,
+												 sortnode->numCols,
+												 sortnode->sortColIdx,
+												 sortnode->sortOperators,
+												 sortnode->collations,
+												 sortnode->nullsFirst,
+												 work_mem,
+												 NULL, false);
 	}
-
-	aggstate->current_phase = newphase;
-	aggstate->phase = &aggstate->phases[newphase];
 }
 
 /*
@@ -398,12 +406,16 @@ static TupleTableSlot *
 fetch_input_tuple(AggState *aggstate)
 {
 	TupleTableSlot *slot;
+	AggStatePerPhaseSort current_phase;
+
+	Assert(!aggstate->phase->is_hashed);
+	current_phase = (AggStatePerPhaseSort) aggstate->phase;
 
-	if (aggstate->sort_in)
+	if (current_phase->sort_in)
 	{
 		/* make sure we check for interrupts in either path through here */
 		CHECK_FOR_INTERRUPTS();
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, false,
+		if (!tuplesort_gettupleslot(current_phase->sort_in, true, false,
 									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
@@ -411,8 +423,13 @@ fetch_input_tuple(AggState *aggstate)
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
-	if (!TupIsNull(slot) && aggstate->sort_out)
-		tuplesort_puttupleslot(aggstate->sort_out, slot);
+	if (!TupIsNull(slot) && current_phase->copy_out)
+	{
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[aggstate->current_phase + 1];
+		Assert(!next->phasedata.is_hashed);
+		tuplesort_puttupleslot(next->sort_in, slot);
+	}
 
 	return slot;
 }
@@ -518,7 +535,7 @@ initialize_aggregates(AggState *aggstate,
 					  int numReset)
 {
 	int			transno;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			setno = 0;
 	int			numTrans = aggstate->numtrans;
 	AggStatePerTrans transstates = aggstate->pertrans;
@@ -1046,10 +1063,9 @@ finalize_partialaggregate(AggState *aggstate,
  * hashslot. This is necessary to compute the hash or perform a lookup.
  */
 static void
-prepare_hash_slot(AggState *aggstate)
+prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash)
 {
 	TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	int				i;
 
@@ -1283,18 +1299,22 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 static void
 build_hash_tables(AggState *aggstate)
 {
-	int				setno;
+	int	phaseidx;
 
-	for (setno = 0; setno < aggstate->num_hashes; ++setno)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[setno];
+		AggStatePerPhase phase;
+		AggStatePerPhaseHash perhash;
 
-		Assert(perhash->aggnode->numGroups > 0);
+		phase = aggstate->phases[phaseidx];
+		if (!phase->is_hashed)
+			continue;
 
+		perhash = (AggStatePerPhaseHash) phase;
 		if (perhash->hashtable)
 			ResetTupleHashTable(perhash->hashtable);
 		else
-			build_hash_table(aggstate, setno, perhash->aggnode->numGroups);
+			build_hash_table(aggstate, perhash);
 	}
 }
 
@@ -1302,9 +1322,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, AggStatePerPhaseHash perhash)
 {
-	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext	metacxt = aggstate->ss.ps.state->es_query_cxt;
 	MemoryContext	hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
 	MemoryContext	tmpcxt	= aggstate->tmpcontext->ecxt_per_tuple_memory;
@@ -1328,8 +1347,8 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 		perhash->hashGrpColIdxHash,
 		perhash->eqfuncoids,
 		perhash->hashfunctions,
-		perhash->aggnode->grpCollations,
-		nbuckets,
+		perhash->phasedata.aggnode->grpCollations,
+		perhash->phasedata.aggnode->numGroups,
 		additionalsize,
 		metacxt,
 		hashcxt,
@@ -1367,23 +1386,29 @@ find_hash_columns(AggState *aggstate)
 {
 	Bitmapset  *base_colnos;
 	List	   *outerTlist = outerPlanState(aggstate)->plan->targetlist;
-	int			numHashes = aggstate->num_hashes;
 	EState	   *estate = aggstate->ss.ps.state;
 	int			j;
 
 	/* Find Vars that will be needed in tlist and qual */
 	base_colnos = find_unaggregated_cols(aggstate);
 
-	for (j = 0; j < numHashes; ++j)
+	for (j = 0; j < aggstate->numphases; ++j)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[j];
+		AggStatePerPhase perphase = aggstate->phases[j];
+		AggStatePerPhaseHash perhash;
 		Bitmapset  *colnos = bms_copy(base_colnos);
-		AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
+		Bitmapset  *grouped_cols = perphase->grouped_cols[0];
+		AttrNumber *grpColIdx = perphase->aggnode->grpColIdx;
 		List	   *hashTlist = NIL;
+		ListCell   *lc;
 		TupleDesc	hashDesc;
 		int			maxCols;
 		int			i;
 
+		if (!perphase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) perphase;
 		perhash->largestGrpColIdx = 0;
 
 		/*
@@ -1393,18 +1418,12 @@ find_hash_columns(AggState *aggstate)
 		 * there'd be no point storing them.  Use prepare_projection_slot's
 		 * logic to determine which.
 		 */
-		if (aggstate->phases[0].grouped_cols)
+		foreach(lc, aggstate->all_grouped_cols)
 		{
-			Bitmapset  *grouped_cols = aggstate->phases[0].grouped_cols[j];
-			ListCell   *lc;
-
-			foreach(lc, aggstate->all_grouped_cols)
-			{
-				int			attnum = lfirst_int(lc);
+			int			attnum = lfirst_int(lc);
 
-				if (!bms_is_member(attnum, grouped_cols))
-					colnos = bms_del_member(colnos, attnum);
-			}
+			if (!bms_is_member(attnum, grouped_cols))
+				colnos = bms_del_member(colnos, attnum);
 		}
 
 		/*
@@ -1460,7 +1479,7 @@ find_hash_columns(AggState *aggstate)
 		hashDesc = ExecTypeFromTL(hashTlist);
 
 		execTuplesHashPrepare(perhash->numCols,
-							  perhash->aggnode->grpOperators,
+							  perphase->aggnode->grpOperators,
 							  &perhash->eqfuncoids,
 							  &perhash->hashfunctions);
 		perhash->hashslot =
@@ -1497,10 +1516,9 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
  * When called, CurrentMemoryContext should be the per-query context. The
  * already-calculated hash value for the tuple must be specified.
  */
-static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash)
+static void 
+lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash, uint32 hash)
 {
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	TupleHashEntryData *entry;
 	bool		isnew;
@@ -1532,7 +1550,7 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
 		}
 	}
 
-	return entry->additional;
+	perhash->phasedata.pergroups[0] = entry->additional;
 }
 
 /*
@@ -1542,21 +1560,19 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
  * Be aware that lookup_hash_entry can reset the tmpcontext.
  */
 static void
-lookup_hash_entries(AggState *aggstate)
+lookup_hash_entries(AggState *aggstate, List *perhashes)
 {
-	int			numHashes = aggstate->num_hashes;
-	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
-	int			setno;
+	ListCell *lc;
 
-	for (setno = 0; setno < numHashes; setno++)
+	foreach (lc, perhashes)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[setno];
 		uint32			hash;
+		AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
 
-		select_current_set(aggstate, setno, true);
-		prepare_hash_slot(aggstate);
+		select_current_set(aggstate, 0, true);
+		prepare_hash_slot(aggstate, perhash);
 		hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
-		pergroup[setno] = lookup_hash_entry(aggstate, hash);
+		lookup_hash_entry(aggstate, perhash, hash);
 	}
 }
 
@@ -1589,12 +1605,11 @@ ExecAgg(PlanState *pstate)
 			case AGG_HASHED:
 				if (!node->table_filled)
 					agg_fill_hash_table(node);
-				/* FALLTHROUGH */
-			case AGG_MIXED:
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+			case AGG_MIXED:
 				if (!node->input_sorted)
 					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
@@ -1622,8 +1637,8 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->numsets > 0;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
 	int			numReset;
@@ -1640,7 +1655,7 @@ agg_retrieve_direct(AggState *aggstate)
 	tmpcontext = aggstate->tmpcontext;
 
 	peragg = aggstate->peragg;
-	pergroups = aggstate->pergroups;
+	pergroups = aggstate->phase->pergroups;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
 	/*
@@ -1698,25 +1713,32 @@ agg_retrieve_direct(AggState *aggstate)
 		{
 			if (aggstate->current_phase < aggstate->numphases - 1)
 			{
+				/* Advance to the next phase */
 				initialize_phase(aggstate, aggstate->current_phase + 1);
-				aggstate->input_done = false;
-				aggstate->projected_set = -1;
-				numGroupingSets = Max(aggstate->phase->numsets, 1);
-				node = aggstate->phase->aggnode;
-				numReset = numGroupingSets;
-			}
-			else if (aggstate->aggstrategy == AGG_MIXED)
-			{
-				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
-				 */
-				initialize_phase(aggstate, 0);
-				aggstate->table_filled = true;
-				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-									   &aggstate->perhash[0].hashiter);
-				select_current_set(aggstate, 0, true);
-				return agg_retrieve_hash_table(aggstate);
+
+				/* Check whether new phase is an AGG_HASHED */
+				if (!aggstate->phase->is_hashed)
+				{
+					aggstate->input_done = false;
+					aggstate->projected_set = -1;
+					numGroupingSets = aggstate->phase->numsets;
+					node = aggstate->phase->aggnode;
+					numReset = numGroupingSets;
+					pergroups = aggstate->phase->pergroups; 
+				}
+				else
+				{
+					AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) aggstate->phase;
+
+					/*
+					 * Mixed mode; we've output all the grouped stuff and have
+					 * full hashtables, so switch to outputting those.
+					 */
+					aggstate->table_filled = true;
+					ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
+					select_current_set(aggstate, 0, true);
+					return agg_retrieve_hash_table(aggstate);
+				}
 			}
 			else
 			{
@@ -1755,11 +1777,11 @@ agg_retrieve_direct(AggState *aggstate)
 		 */
 		tmpcontext->ecxt_innertuple = econtext->ecxt_outertuple;
 		if (aggstate->input_done ||
-			(node->aggstrategy != AGG_PLAIN &&
+			(aggstate->phase->aggnode->numCols > 0 &&
 			 aggstate->projected_set != -1 &&
 			 aggstate->projected_set < (numGroupingSets - 1) &&
 			 nextSetSize > 0 &&
-			 !ExecQualAndReset(aggstate->phase->eqfunctions[nextSetSize - 1],
+			 !ExecQualAndReset(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[nextSetSize - 1],
 							   tmpcontext)))
 		{
 			aggstate->projected_set += 1;
@@ -1862,13 +1884,13 @@ agg_retrieve_direct(AggState *aggstate)
 				for (;;)
 				{
 					/*
-					 * During phase 1 only of a mixed agg, we need to update
-					 * hashtables as well in advance_aggregates.
+					 * If current phase can do transition concurrently, we need
+					 * to update hashtables as well in advance_aggregates.
 					 */
-					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+					if (aggstate->phase->concurrent_hashes)
 					{
-						lookup_hash_entries(aggstate);
+						lookup_hash_entries(aggstate,
+											aggstate->phase->concurrent_hashes);
 					}
 
 					/* Advance the aggregates (or combine functions) */
@@ -1899,10 +1921,10 @@ agg_retrieve_direct(AggState *aggstate)
 					 * If we are grouping, check whether we've crossed a group
 					 * boundary.
 					 */
-					if (node->aggstrategy != AGG_PLAIN)
+					if (aggstate->phase->aggnode->numCols > 0)
 					{
 						tmpcontext->ecxt_innertuple = firstSlot;
-						if (!ExecQual(aggstate->phase->eqfunctions[node->numCols - 1],
+						if (!ExecQual(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[node->numCols - 1],
 									  tmpcontext))
 						{
 							aggstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
@@ -1951,24 +1973,31 @@ agg_retrieve_direct(AggState *aggstate)
 static void
 agg_sort_input(AggState *aggstate)
 {
-	AggStatePerPhase phase = &aggstate->phases[1];
+	AggStatePerPhase phase = aggstate->phases[0];
+	AggStatePerPhaseSort persort = (AggStatePerPhaseSort) phase;
 	TupleDesc	tupDesc;
 	Sort		*sortnode;
+	bool		randomAccess;
 
 	Assert(!aggstate->input_sorted);
+	Assert(!phase->is_hashed);
 	Assert(phase->aggnode->sortnode);
 
 	sortnode = (Sort *) phase->aggnode->sortnode;
 	tupDesc = ExecGetResultType(outerPlanState(aggstate));
-
-	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
-											 sortnode->numCols,
-											 sortnode->sortColIdx,
-											 sortnode->sortOperators,
-											 sortnode->collations,
-											 sortnode->nullsFirst,
-											 work_mem,
-											 NULL, false);
+	randomAccess = (aggstate->eflags & (EXEC_FLAG_REWIND |
+										EXEC_FLAG_BACKWARD |
+										EXEC_FLAG_MARK)) != 0;
+
+
+	persort->sort_in = tuplesort_begin_heap(tupDesc,
+											sortnode->numCols,
+											sortnode->sortColIdx,
+											sortnode->sortOperators,
+											sortnode->collations,
+											sortnode->nullsFirst,
+											work_mem,
+											NULL, randomAccess);
 	for (;;)
 	{
 		TupleTableSlot *outerslot;
@@ -1977,11 +2006,11 @@ agg_sort_input(AggState *aggstate)
 		if (TupIsNull(outerslot))
 			break;
 
-		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+		tuplesort_puttupleslot(persort->sort_in, outerslot);
 	}
 
 	/* Sort the first phase */
-	tuplesort_performsort(aggstate->sort_in);
+	tuplesort_performsort(persort->sort_in);
 
 	/* Mark the input to be sorted */
 	aggstate->input_sorted = true;
@@ -1993,8 +2022,14 @@ agg_sort_input(AggState *aggstate)
 static void
 agg_fill_hash_table(AggState *aggstate)
 {
+	AggStatePerPhaseHash currentphase;
 	TupleTableSlot *outerslot;
 	ExprContext *tmpcontext = aggstate->tmpcontext;
+	List *concurrent_hashes = aggstate->phase->concurrent_hashes;
+
+	/* Current phase must be the first phase */
+	Assert(aggstate->current_phase == 0);
+	currentphase = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
@@ -2002,15 +2037,25 @@ agg_fill_hash_table(AggState *aggstate)
 	 */
 	for (;;)
 	{
-		outerslot = fetch_input_tuple(aggstate);
+		int	hash;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
 		if (TupIsNull(outerslot))
 			break;
 
 		/* set up for lookup_hash_entries and advance_aggregates */
 		tmpcontext->ecxt_outertuple = outerslot;
 
-		/* Find or build hashtable entries */
-		lookup_hash_entries(aggstate);
+		/* Find hashtable entry of current phase */
+		select_current_set(aggstate, 0, true);
+		prepare_hash_slot(aggstate, currentphase);
+		hash = TupleHashTableHash(currentphase->hashtable, currentphase->hashslot);
+		lookup_hash_entry(aggstate, currentphase, hash);
+
+
+		/* Find or build hashtable entries of concurrent hash phases */
+		if (concurrent_hashes)
+			lookup_hash_entries(aggstate, concurrent_hashes);
 
 		/* Advance the aggregates (or combine functions) */
 		advance_aggregates(aggstate);
@@ -2025,8 +2070,7 @@ agg_fill_hash_table(AggState *aggstate)
 	aggstate->table_filled = true;
 	/* Initialize to walk the first hash table */
 	select_current_set(aggstate, 0, true);
-	ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-						   &aggstate->perhash[0].hashiter);
+	ResetTupleHashIterator(currentphase->hashtable, &currentphase->hashiter);
 }
 
 /*
@@ -2041,7 +2085,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 	TupleHashEntryData *entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	AggStatePerHash perhash;
+	AggStatePerPhaseHash perhash;
 
 	/*
 	 * get state info from node.
@@ -2052,11 +2096,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 	peragg = aggstate->peragg;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
-	/*
-	 * Note that perhash (and therefore anything accessed through it) can
-	 * change inside the loop, as we change between grouping sets.
-	 */
-	perhash = &aggstate->perhash[aggstate->current_set];
+	perhash = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * We loop retrieving groups until we find one satisfying
@@ -2075,18 +2115,15 @@ agg_retrieve_hash_table(AggState *aggstate)
 		entry = ScanTupleHashTable(perhash->hashtable, &perhash->hashiter);
 		if (entry == NULL)
 		{
-			int			nextset = aggstate->current_set + 1;
-
-			if (nextset < aggstate->num_hashes)
+			if (aggstate->current_phase + 1 < aggstate->numphases)
 			{
 				/*
 				 * Switch to next grouping set, reinitialize, and restart the
 				 * loop.
 				 */
-				select_current_set(aggstate, nextset, true);
-
-				perhash = &aggstate->perhash[aggstate->current_set];
-
+				select_current_set(aggstate, 0, true);
+				initialize_phase(aggstate, aggstate->current_phase + 1);
+				perhash = (AggStatePerPhaseHash) aggstate->phase;	
 				ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 				continue;
@@ -2165,23 +2202,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	AggState   *aggstate;
 	AggStatePerAgg peraggs;
 	AggStatePerTrans pertransstates;
-	AggStatePerGroup *pergroups;
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
-	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
-	int			phase;
 	int			phaseidx;
 	ListCell   *l;
 	Bitmapset  *all_grouped_cols = NULL;
 	int			numGroupingSets = 1;
-	int			numPhases;
-	int			numHashes;
 	int			i = 0;
 	int			j = 0;
+	bool		need_extra_slot = false;
 	bool		use_hashing = (node->aggstrategy == AGG_HASHED ||
 							   node->aggstrategy == AGG_MIXED);
 
@@ -2210,24 +2243,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->curpertrans = NULL;
 	aggstate->input_done = false;
 	aggstate->agg_done = false;
-	aggstate->pergroups = NULL;
 	aggstate->grp_firstTuple = NULL;
-	aggstate->sort_in = NULL;
-	aggstate->sort_out = NULL;
 	aggstate->input_sorted = true;
-
-	/*
-	 * phases[0] always exists, but is dummy in sorted/plain mode
-	 */
-	numPhases = (use_hashing ? 1 : 2);
-	numHashes = (use_hashing ? 1 : 0);
-
-	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+	aggstate->eflags = eflags;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
-	 * determines the size of some allocations.  Also calculate the number of
-	 * phases, since all hashed/mixed nodes contribute to only a single phase.
+	 * determines the size of some allocations.
 	 */
 	if (node->groupingSets)
 	{
@@ -2240,31 +2262,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			numGroupingSets = Max(numGroupingSets,
 								  list_length(agg->groupingSets));
 
-			/*
-			 * additional AGG_HASHED aggs become part of phase 0, but all
-			 * others add an extra phase.
-			 */
 			if (agg->aggstrategy != AGG_HASHED)
-			{
-				++numPhases;
-
-				if (!firstSortAgg)
-					firstSortAgg = agg;
-
-			}
-			else
-				++numHashes;
+				need_extra_slot = true;
 		}
 	}
 
 	aggstate->maxsets = numGroupingSets;
-	aggstate->numphases = numPhases;
-
+	aggstate->numphases = 1 + list_length(node->chain);
+	
 	/*
-	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
-	if (firstSortAgg && firstSortAgg->sortnode)
+	if (node->sortnode)
 		aggstate->input_sorted = false;	
 
 	aggstate->aggcontexts = (ExprContext **)
@@ -2325,11 +2335,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	scanDesc = aggstate->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
 
 	/*
-	 * If there are more than two phases (including a potential dummy phase
-	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 * An extra slot is needed if 1) agg need to do its own sort 2) agg
+	 * has more than one non-hashed phases
 	 */
-	if (numPhases > 2 ||
-		!aggstate->input_sorted)
+	if (node->sortnode || need_extra_slot)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -2381,76 +2390,94 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numaggs = aggstate->numaggs;
 	Assert(numaggs == list_length(aggstate->aggs));
 
-	/*
+	/* 
 	 * For each phase, prepare grouping set data and fmgr lookup data for
 	 * compare functions.  Accumulate all_grouped_cols in passing.
 	 */
-	aggstate->phases = palloc0(numPhases * sizeof(AggStatePerPhaseData));
-
-	aggstate->num_hashes = numHashes;
-	if (numHashes)
-	{
-		aggstate->perhash = palloc0(sizeof(AggStatePerHashData) * numHashes);
-		aggstate->phases[0].numsets = 0;
-		aggstate->phases[0].gset_lengths = palloc(numHashes * sizeof(int));
-		aggstate->phases[0].grouped_cols = palloc(numHashes * sizeof(Bitmapset *));
-	}
+	aggstate->phases = palloc0(aggstate->numphases * sizeof(AggStatePerPhase));
 
-	phase = 0;
-	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
 		Agg		   *aggnode;
+		AggStatePerPhase phasedata = NULL;
 
 		if (phaseidx > 0)
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
 		else
 			aggnode = node;
 
-		if (aggnode->aggstrategy == AGG_HASHED
-			|| aggnode->aggstrategy == AGG_MIXED)
+		if (aggnode->aggstrategy == AGG_HASHED)
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[0];
-			AggStatePerHash perhash;
-			Bitmapset  *cols = NULL;
-
-			Assert(phase == 0);
-			i = phasedata->numsets++;
-			perhash = &aggstate->perhash[i];
+			AggStatePerPhaseHash perhash;
+			Bitmapset *cols = NULL;
 
-			/* phase 0 always points to the "real" Agg in the hash case */
-			phasedata->aggnode = node;
-			phasedata->aggstrategy = node->aggstrategy;
+			perhash = (AggStatePerPhaseHash) palloc0(sizeof(AggStatePerPhaseHashData));
+			phasedata = (AggStatePerPhase) perhash;
+			phasedata->is_hashed = true;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			/* but the actual Agg node representing this hash is saved here */
-			perhash->aggnode = aggnode;
+			/* AGG_HASHED always has only one set */
+			phasedata->numsets = 1;
 
-			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
+			phasedata->gset_lengths = palloc(sizeof(int));
+			phasedata->gset_lengths[0] = perhash->numCols = aggnode->numCols;
 
+			phasedata->grouped_cols = palloc(sizeof(Bitmapset *));
 			for (j = 0; j < aggnode->numCols; ++j)
 				cols = bms_add_member(cols, aggnode->grpColIdx[j]);
-
-			phasedata->grouped_cols[i] = cols;
+			phasedata->grouped_cols[0] = cols;
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
-			continue;
+
+			/* 
+			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
+			 * on the fly, all pergroup states are kept in hashtable, everytime
+			 * a tuple is processed, lookup_hash_entry() choose one group and
+			 * set phasedata->pergroups[0], then advance_aggregates can use it
+			 * to do transition in this group.
+			 * We do not need to allocate a real pergroup and set the pointer
+			 * here, there are too many pergroup states, lookup_hash_entry() will
+			 * allocate it.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup));
+
+			/*
+			 * Hash aggregate does not require the order of input tuples, so
+			 * we can do the transition immediately when a tuple is fetched,
+			 * which means we can do the transition concurrently with the
+			 * first phase.
+			 */
+			if (phaseidx > 0)
+			{
+				aggstate->phases[0]->concurrent_hashes =
+					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
+				/* current phase don't need to build transition functions */
+				phasedata->skip_build_trans = true;
+			}
 		}
 		else
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[++phase];
-			int			num_sets;
+			AggStatePerPhaseSort persort;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			persort = (AggStatePerPhaseSort) palloc0(sizeof(AggStatePerPhaseSortData));
+			phasedata = (AggStatePerPhase) persort;
+			phasedata->is_hashed = false;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (num_sets)
+			if (aggnode->groupingSets)
 			{
-				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
-				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
+				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
 
 				i = 0;
 				foreach(l, aggnode->groupingSets)
 				{
-					int			current_length = list_length(lfirst(l));
-					Bitmapset  *cols = NULL;
+					int		current_length = list_length(lfirst(l));
+					Bitmapset	*cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2467,37 +2494,49 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 			else
 			{
-				Assert(phaseidx == 0);
-
+				phasedata->numsets = 1;
 				phasedata->gset_lengths = NULL;
 				phasedata->grouped_cols = NULL;
 			}
 
+			/* 
+			 * Initialize pergroup states for AGG_SORTED/AGG_PLAIN/AGG_MIXED
+			 * phases, each set only have one group on the fly, all groups in
+			 * a set can reuse a pergroup state. Unlike AGG_HASHED, we
+			 * pre-allocate the pergroup states here.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup) * phasedata->numsets);
+
+			for (i = 0; i < phasedata->numsets; i++)
+			{
+				phasedata->pergroups[i] =
+					(AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData) * numaggs);
+			}
+
 			/*
 			 * If we are grouping, precompute fmgr lookup data for inner loop.
 			 */
-			if (aggnode->aggstrategy == AGG_SORTED)
+			if (aggnode->numCols > 0)
 			{
 				int			i = 0;
 
-				Assert(aggnode->numCols > 0);
-
 				/*
 				 * Build a separate function for each subset of columns that
 				 * need to be compared.
 				 */
-				phasedata->eqfunctions =
+				persort->eqfunctions =
 					(ExprState **) palloc0(aggnode->numCols * sizeof(ExprState *));
 
 				/* for each grouping set */
-				for (i = 0; i < phasedata->numsets; i++)
+				for (i = 0; i < phasedata->numsets && phasedata->gset_lengths; i++)
 				{
 					int			length = phasedata->gset_lengths[i];
 
-					if (phasedata->eqfunctions[length - 1] != NULL)
+					if (persort->eqfunctions[length - 1] != NULL)
 						continue;
 
-					phasedata->eqfunctions[length - 1] =
+					persort->eqfunctions[length - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   length,
 											   aggnode->grpColIdx,
@@ -2507,9 +2546,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 
 				/* and for all grouped columns, unless already computed */
-				if (phasedata->eqfunctions[aggnode->numCols - 1] == NULL)
+				if (persort->eqfunctions[aggnode->numCols - 1] == NULL)
 				{
-					phasedata->eqfunctions[aggnode->numCols - 1] =
+					persort->eqfunctions[aggnode->numCols - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   aggnode->numCols,
 											   aggnode->grpColIdx,
@@ -2519,9 +2558,23 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 			}
 
-			phasedata->aggnode = aggnode;
-			phasedata->aggstrategy = aggnode->aggstrategy;
+			/*
+			 * For non-first AGG_SORTED phase, it processes the same input
+			 * tuples with previous phase except that it need to resort the
+			 * input tuples. Tell the previous phase to copy out the tuples.
+			 */
+			if (phaseidx > 0)
+			{
+				AggStatePerPhaseSort prev =
+					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
+
+				Assert(!prev->phasedata.is_hashed);
+				/* Tell the previous phase to copy the tuple to the sort_in */
+				prev->copy_out = true;
+			}
 		}
+
+		aggstate->phases[phaseidx] = phasedata;
 	}
 
 	/*
@@ -2545,51 +2598,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->peragg = peraggs;
 	aggstate->pertrans = pertransstates;
 
-
-	aggstate->all_pergroups =
-		(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup)
-									 * (numGroupingSets + numHashes));
-	pergroups = aggstate->all_pergroups;
-
-	if (node->aggstrategy != AGG_HASHED)
-	{
-		for (i = 0; i < numGroupingSets; i++)
-		{
-			pergroups[i] = (AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData)
-													  * numaggs);
-		}
-
-		aggstate->pergroups = pergroups;
-		pergroups += numGroupingSets;
-	}
-
 	/*
-	 * Hashing can only appear in the initial phase.
+	 * Initialize current phase-dependent values to initial phase.
 	 */
-	if (use_hashing)
-	{
-		/* this is an array of pointers, not structures */
-		aggstate->hash_pergroup = pergroups;
-	}
-
-	/*
-	 * Initialize current phase-dependent values to initial phase. The initial
-	 * phase is 1 (first sort pass) for all strategies that use sorting (if
-	 * hashing is being done too, then phase 0 is processed last); but if only
-	 * hashing is being done, then phase 0 is all there is.
-	 */
-	if (node->aggstrategy == AGG_HASHED)
-	{
-		aggstate->current_phase = 0;
-		initialize_phase(aggstate, 0);
-		select_current_set(aggstate, 0, true);
-	}
-	else
-	{
-		aggstate->current_phase = 1;
-		initialize_phase(aggstate, 1);
-		select_current_set(aggstate, 0, false);
-	}
+	aggstate->current_phase = 0;
+	initialize_phase(aggstate, 0);
+	select_current_set(aggstate, 0, aggstate->aggstrategy == AGG_HASHED);
 
 	/* -----------------
 	 * Perform lookups of aggregate function info, and initialize the
@@ -2942,49 +2956,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerPhase phase = &aggstate->phases[phaseidx];
-		bool		dohash = false;
-		bool		dosort = false;
+		AggStatePerPhase phase = aggstate->phases[phaseidx];
 
-		/* phase 0 doesn't necessarily exist */
-		if (!phase->aggnode)
-			continue;
-
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
-		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
-		}
-		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
-		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
+		if (phase->skip_build_trans)
 			continue;
-		}
-		else if (phase->aggstrategy == AGG_PLAIN ||
-				 phase->aggstrategy == AGG_SORTED)
-		{
-			dohash = false;
-			dosort = true;
-		}
-		else if (phase->aggstrategy == AGG_HASHED)
-		{
-			dohash = true;
-			dosort = false;
-		}
-		else
-			Assert(false);
-
-		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
 
+		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, false);
 	}
 
 	return aggstate;
@@ -3470,13 +3447,21 @@ ExecEndAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	/* Make sure we have closed any open tuplesorts */
+	for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
+	{
+		AggStatePerPhase		phase = node->phases[phaseidx++];
+		AggStatePerPhaseSort	persort;
 
-	if (node->sort_in)
-		tuplesort_end(node->sort_in);
-	if (node->sort_out)
-		tuplesort_end(node->sort_out);
+		if (phase->is_hashed)
+			continue;
+
+		persort = (AggStatePerPhaseSort) phase;
+		if (persort->sort_in)
+			tuplesort_end(persort->sort_in);
+	}
 
 	for (transno = 0; transno < node->numtrans; transno++)
 	{
@@ -3518,6 +3503,7 @@ ExecReScanAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	node->agg_done = false;
 
@@ -3541,8 +3527,12 @@ ExecReScanAgg(AggState *node)
 		if (outerPlan->chgParam == NULL &&
 			!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
 		{
-			ResetTupleHashIterator(node->perhash[0].hashtable,
-								   &node->perhash[0].hashiter);
+			AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) node->phases[0];
+			ResetTupleHashIterator(perhash->hashtable,
+								   &perhash->hashiter);
+
+			/* reset to phase 0 */
+			initialize_phase(node, 0);
 			select_current_set(node, 0, true);
 			return;
 		}
@@ -3607,18 +3597,54 @@ ExecReScanAgg(AggState *node)
 		/*
 		 * Reset the per-group state (in particular, mark transvalues null)
 		 */
-		for (setno = 0; setno < numGroupingSets; setno++)
+		for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
 		{
-			MemSet(node->pergroups[setno], 0,
-				   sizeof(AggStatePerGroupData) * node->numaggs);
+			AggStatePerPhase phase = node->phases[phaseidx];
+
+			/* hash pergroups is reset by build_hash_tables */
+			if (phase->is_hashed)
+				continue;
+
+			for (setno = 0; setno < phase->numsets; setno++)
+				MemSet(phase->pergroups[setno], 0,
+					   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
-		/* Reset input_sorted */
+		/* 
+		 * the agg did its own first sort using tuplesort and the first
+		 * tuplesort is kept (see initialize_phase), if the subplan does
+		 * not have any parameter changes, and none of our own parameter
+		 * changes affect input expressions of the aggregated functions,
+		 * then we can just rescan the first tuplesort, no need to build
+		 * it again.
+		 *
+		 * Note: agg only do its own sort for groupingsets now.
+		 */
 		if (aggnode->sortnode)
-			node->input_sorted = false;
+		{
+			AggStatePerPhaseSort firstphase = (AggStatePerPhaseSort) node->phases[0];
+			bool randomAccess = (node->eflags & (EXEC_FLAG_REWIND |
+												 EXEC_FLAG_BACKWARD |
+												 EXEC_FLAG_MARK)) != 0;
+			if (firstphase->sort_in &&
+				randomAccess &&
+				outerPlan->chgParam == NULL &&
+				!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
+			{
+				tuplesort_rescan(firstphase->sort_in);
+				node->input_sorted = true;
+			}
+			else
+			{
+				if (firstphase->sort_in)
+					tuplesort_end(firstphase->sort_in);
+				firstphase->sort_in = NULL;
+				node->input_sorted = false;
+			}
+		}
 
-		/* reset to phase 1 */
-		initialize_phase(node, 1);
+		/* reset to phase 0 */
+		initialize_phase(node, 0);
 
 		node->input_done = false;
 		node->projected_set = -1;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index b855e73957..066cd59554 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2049,30 +2049,26 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_AGG_PLAIN_PERGROUP_NULLCHECK:
 				{
 					int				 jumpnull;
-					LLVMValueRef	 v_aggstatep;
-					LLVMValueRef	 v_allpergroupsp;
+					LLVMValueRef	 v_pergroupsp;
 					LLVMValueRef	 v_pergroup_allaggs;
-					LLVMValueRef	 v_setoff;
+					LLVMValueRef	 v_setno;
 
 					jumpnull = op->d.agg_plain_pergroup_nullcheck.jumpnull;
 
 					/*
-					 * pergroup_allaggs = aggstate->all_pergroups
-					 * [op->d.agg_plain_pergroup_nullcheck.setoff];
+					 * pergroup =
+					 * &op->d.agg_plain_pergroup_nullcheck.pergroups
+					 * [op->d.agg_plain_pergroup_nullcheck.setno];
 					 */
-					v_aggstatep = LLVMBuildBitCast(
-						b, v_parent, l_ptr(StructAggState), "");
+					v_pergroupsp =
+						l_ptr_const(op->d.agg_plain_pergroup_nullcheck.pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
 
-					v_allpergroupsp = l_load_struct_gep(
-						b, v_aggstatep,
-						FIELDNO_AGGSTATE_ALL_PERGROUPS,
-						"aggstate.all_pergroups");
+					v_setno =
+						l_int32_const(op->d.agg_plain_pergroup_nullcheck.setno);
 
-					v_setoff = l_int32_const(
-						op->d.agg_plain_pergroup_nullcheck.setoff);
-
-					v_pergroup_allaggs = l_load_gep1(
-						b, v_allpergroupsp, v_setoff, "");
+					v_pergroup_allaggs =
+						l_load_gep1(b, v_pergroupsp, v_setno, "");
 
 					LLVMBuildCondBr(
 						b,
@@ -2094,6 +2090,7 @@ llvm_compile_expr(ExprState *state)
 				{
 					AggState   *aggstate;
 					AggStatePerTrans pertrans;
+					AggStatePerGroup *pergroups;
 					FunctionCallInfo fcinfo;
 
 					LLVMValueRef v_aggstatep;
@@ -2103,12 +2100,12 @@ llvm_compile_expr(ExprState *state)
 					LLVMValueRef v_transvaluep;
 					LLVMValueRef v_transnullp;
 
-					LLVMValueRef v_setoff;
+					LLVMValueRef v_setno;
 					LLVMValueRef v_transno;
 
 					LLVMValueRef v_aggcontext;
 
-					LLVMValueRef v_allpergroupsp;
+					LLVMValueRef v_pergroupsp;
 					LLVMValueRef v_current_setp;
 					LLVMValueRef v_current_pertransp;
 					LLVMValueRef v_curaggcontext;
@@ -2124,6 +2121,7 @@ llvm_compile_expr(ExprState *state)
 
 					aggstate = castNode(AggState, state->parent);
 					pertrans = op->d.agg_trans.pertrans;
+					pergroups = op->d.agg_trans.pergroups;
 
 					fcinfo = pertrans->transfn_fcinfo;
 
@@ -2133,19 +2131,18 @@ llvm_compile_expr(ExprState *state)
 											  l_ptr(StructAggStatePerTransData));
 
 					/*
-					 * pergroup = &aggstate->all_pergroups
-					 * [op->d.agg_strict_trans_check.setoff]
-					 * [op->d.agg_init_trans_check.transno];
+					 * pergroup = &op->d.agg_trans.pergroups
+					 * [op->d.agg_trans.setno]
+					 * [op->d.agg_trans.transno];
 					 */
-					v_allpergroupsp =
-						l_load_struct_gep(b, v_aggstatep,
-										  FIELDNO_AGGSTATE_ALL_PERGROUPS,
-										  "aggstate.all_pergroups");
-					v_setoff = l_int32_const(op->d.agg_trans.setoff);
+					v_pergroupsp =
+						l_ptr_const(pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
+					v_setno = l_int32_const(op->d.agg_trans.setno);
 					v_transno = l_int32_const(op->d.agg_trans.transno);
 					v_pergroupp =
 						LLVMBuildGEP(b,
-									 l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
+									 l_load_gep1(b, v_pergroupsp, v_setno, ""),
 									 &v_transno, 1, "");
 
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 044ec92aa8..29f88bf0b7 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2226,8 +2226,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	chain = NIL;
 	if (list_length(rollups) > 1)
 	{
-		bool		is_first_sort = ((RollupData *) linitial(rollups))->is_hashed;
-
 		for_each_cell(lc, rollups, list_second_cell(rollups))
 		{
 			RollupData *rollup = lfirst(lc);
@@ -2244,24 +2242,17 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			 */
 			if (!rollup->is_hashed)
 			{
-				if (!is_first_sort ||
-					(is_first_sort && !best_path->is_sorted))
-				{
-					sort_plan = (Plan *)
-						make_sort_from_groupcols(rollup->groupClause,
-												 new_grpColIdx,
-												 subplan);
-
-					/*
-					 * Remove stuff we don't need to avoid bloating debug output.
-					 */
-					sort_plan->targetlist = NIL;
-					sort_plan->lefttree = NULL;
-				}
-			}
+				sort_plan = (Plan *)
+					make_sort_from_groupcols(rollup->groupClause,
+											 new_grpColIdx,
+											 subplan);
 
-			if (!rollup->is_hashed)
-				is_first_sort = false;
+				/*
+				 * Remove stuff we don't need to avoid bloating debug output.
+				 */
+				sort_plan->targetlist = NIL;
+				sort_plan->lefttree = NULL;
+			}
 
 			if (rollup->is_hashed)
 				strat = AGG_HASHED;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index fc6a1d0044..68d9c88a53 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4344,7 +4344,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (unhashed_rollup)
 		{
-			new_rollups = lappend(new_rollups, unhashed_rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(unhashed_rollup, new_rollups);
 			strat = AGG_MIXED;
 		}
 		else if (empty_sets)
@@ -4357,7 +4358,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = list_length(empty_sets);
 			rollup->hashable = false;
 			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(rollup, new_rollups);
 			/* update is_sorted to true
 			 * XXX why? shouldn't it be already set by the caller?
 			 */
@@ -4525,7 +4527,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			/* non-hashed rollup always sit before hashed rollup */
+			rollups = lappend(rollups, rollup);
 		}
 
 		if (rollups)
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 0feb3363d3..2dfa3fa17e 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2983,7 +2983,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
- * 'is_sorted' is the input sorted in the group cols of first rollup
+ * 'is_sorted' is the input sorted in the group cols of first rollup 
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -3000,7 +3000,6 @@ create_groupingsets_path(PlannerInfo *root,
 	PathTarget *target = rel->reltarget;
 	ListCell   *lc;
 	bool		is_first = true;
-	bool		is_first_sort = true;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3053,14 +3052,13 @@ create_groupingsets_path(PlannerInfo *root,
 		int			numGroupCols = list_length(linitial(gsets));
 
 		/*
-		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup takes the
-		 * (already-sorted) input, and following ones do their own sort.
+		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
+		 * sort if is_sorted is false, the following ones do their own sort.
 		 *
 		 * In AGG_HASHED mode, there is one rollup for each grouping set.
 		 *
-		 * In AGG_MIXED mode, the first rollups are hashed, the first
-		 * non-hashed one takes the (already-sorted) input, and following ones
-		 * do their own sort.
+		 * In AGG_MIXED mode, the first rollup do its own sort if is_sorted
+		 * is false, the following non-hashed ones do their own sort.
 		 */
 		if (is_first)
 		{
@@ -3092,33 +3090,23 @@ create_groupingsets_path(PlannerInfo *root,
 					 input_startup_cost,
 					 input_total_cost,
 					 subpath->rows);
+
 			is_first = false;
-			if (!rollup->is_hashed)
-				is_first_sort = false;
 		}
 		else
 		{
-			Path		sort_path;	/* dummy for result of cost_sort */
-			Path		agg_path;	/* dummy for result of cost_agg */
-
-			if (rollup->is_hashed || (is_first_sort && is_sorted))
-			{
-				/*
-				 * Account for cost of aggregation, but don't charge input
-				 * cost again
-				 */
-				cost_agg(&agg_path, root,
-						 rollup->is_hashed ? AGG_HASHED : AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 0.0, 0.0,
-						 subpath->rows);
-				if (!rollup->is_hashed)
-					is_first_sort = false;
-			}
-			else
+			AggStrategy	rollup_strategy;
+			Path	sort_path;	/* dummy for result of cost_sort */
+			Path	agg_path;	/* dummy for result of cost_agg */
+			
+			sort_path.startup_cost = 0;
+			sort_path.total_cost = 0;
+			sort_path.rows = subpath->rows;
+
+			rollup_strategy = rollup->is_hashed ?
+				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
+
+			if (!rollup->is_hashed && numGroupCols)
 			{
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
@@ -3128,20 +3116,19 @@ create_groupingsets_path(PlannerInfo *root,
 						  0.0,
 						  work_mem,
 						  -1.0);
-
-				/* Account for cost of aggregation */
-
-				cost_agg(&agg_path, root,
-						 AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 sort_path.startup_cost,
-						 sort_path.total_cost,
-						 sort_path.rows);
 			}
 
+			/* Account for cost of aggregation */
+			cost_agg(&agg_path, root,
+					 rollup_strategy,
+					 agg_costs,
+					 numGroupCols,
+					 rollup->numGroups,
+					 having_qual,
+					 sort_path.startup_cost,
+					 sort_path.total_cost,
+					 sort_path.rows);
+
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index dbe8649a57..4ed5d0a7de 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -626,7 +626,8 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_PLAIN_PERGROUP_NULLCHECK */
 		struct
 		{
-			int			setoff;
+			AggStatePerGroup *pergroups;
+			int			setno;
 			int			jumpnull;
 		}			agg_plain_pergroup_nullcheck;
 
@@ -634,11 +635,11 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
 		struct
 		{
+			AggStatePerGroup *pergroups;
 			AggStatePerTrans pertrans;
 			ExprContext *aggcontext;
 			int			setno;
 			int			transno;
-			int			setoff;
 		}			agg_trans;
 	}			d;
 } ExprEvalStep;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 94890512dc..1f37f9236b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool nullcheck);
 extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
 										 const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
 										 int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 66a83b9ac9..c5d4121c37 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -270,16 +270,29 @@ typedef struct AggStatePerGroupData
  */
 typedef struct AggStatePerPhaseData
 {
+	bool		is_hashed;		/* plan to do hash aggregate */
 	AggStrategy aggstrategy;	/* strategy for this phase */
-	int			numsets;		/* number of grouping sets (or 0) */
+	int			numsets;		/* number of grouping sets */
 	int		   *gset_lengths;	/* lengths of grouping sets */
 	Bitmapset **grouped_cols;	/* column groupings for rollup */
-	ExprState **eqfunctions;	/* expression returning equality, indexed by
-								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+
+	List		*concurrent_hashes;	/* hash phases can do transition concurrently */
+	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
+	bool		skip_build_trans;
 }			AggStatePerPhaseData;
 
+typedef struct AggStatePerPhaseSortData
+{
+	AggStatePerPhaseData phasedata;
+	Tuplesortstate	*sort_in;		/* sorted input to phases > 1 */
+	Tuplestorestate	*store_in;		/* sorted input to phases > 1 */
+	ExprState 		**eqfunctions;	/* expression returning equality, indexed by
+									 * nr of cols to compare */
+	bool			copy_out;		/* hint for copy input tuples for next phase */
+}			AggStatePerPhaseSortData;
+
 /*
  * AggStatePerHashData - per-hashtable state
  *
@@ -287,8 +300,9 @@ typedef struct AggStatePerPhaseData
  * grouping set. (When doing hashing without grouping sets, we have just one of
  * them.)
  */
-typedef struct AggStatePerHashData
+typedef struct AggStatePerPhaseHashData
 {
+	AggStatePerPhaseData phasedata;
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleHashIterator hashiter; /* for iterating through hash table */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
@@ -299,9 +313,7 @@ typedef struct AggStatePerHashData
 	int			largestGrpColIdx;	/* largest col required for hashing */
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
-	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
-
+}			AggStatePerPhaseHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 5e33a368f5..4081a0978e 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2036,7 +2036,8 @@ typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
 typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
-typedef struct AggStatePerHashData *AggStatePerHash;
+typedef struct AggStatePerPhaseSortData *AggStatePerPhaseSort;
+typedef struct AggStatePerPhaseHashData *AggStatePerPhaseHash;
 
 typedef struct AggState
 {
@@ -2068,28 +2069,19 @@ typedef struct AggState
 	List	   *all_grouped_cols;	/* list of all grouped cols in DESC order */
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
-	AggStatePerPhase phases;	/* array of all phases */
+	AggStatePerPhase *phases;	/* array of all phases */
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
-	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
-									 * pointers */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
-	/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
+	/* these fields are used in AGG_HASHED */
 	bool		table_filled;	/* hash table filled yet? */
-	int			num_hashes;
-	AggStatePerHash perhash;	/* array of per-hashtable data */
-	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
-										 * per-group pointers */
 
 	/* these fields are used in AGG_SORTED and AGG_MIXED */
 	bool		input_sorted;	/* hash table filled yet? */
+	int			eflags;			/* eflags for the first sort */
 
-	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 35
-	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
-										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
 } AggState;
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 12425f46ca..e7689ebd16 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1004,10 +1004,10 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
  Sort
    Sort Key: (GROUPING("*VALUES*".column1, "*VALUES*".column2)), "*VALUES*".column1, "*VALUES*".column2
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: "*VALUES*".column1, "*VALUES*".column2
          Hash Key: "*VALUES*".column1
          Hash Key: "*VALUES*".column2
-         Group Key: ()
          ->  Values Scan on "*VALUES*"
 (8 rows)
 
@@ -1066,9 +1066,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: unsortable_col
          Sort Key: unhashable_col
            Group Key: unhashable_col
+         Hash Key: unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1108,9 +1108,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: v, unsortable_col
          Sort Key: v, unhashable_col
            Group Key: v, unhashable_col
+         Hash Key: v, unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1149,10 +1149,10 @@ explain (costs off)
            QUERY PLAN           
 --------------------------------
  MixedAggregate
-   Hash Key: a, b
    Group Key: ()
    Group Key: ()
    Group Key: ()
+   Hash Key: a, b
    ->  Seq Scan on gstest_empty
 (6 rows)
 
@@ -1310,10 +1310,10 @@ explain (costs off)
          ->  Sort
                Sort Key: a, b
                ->  MixedAggregate
+                     Group Key: ()
                      Hash Key: a, b
                      Hash Key: a
                      Hash Key: b
-                     Group Key: ()
                      ->  Seq Scan on gstest2
 (11 rows)
 
@@ -1345,10 +1345,10 @@ explain (costs off)
  Sort
    Sort Key: gstest_data.a, gstest_data.b
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: gstest_data.a, gstest_data.b
          Hash Key: gstest_data.a
          Hash Key: gstest_data.b
-         Group Key: ()
          ->  Nested Loop
                ->  Values Scan on "*VALUES*"
                ->  Function Scan on gstest_data
@@ -1545,16 +1545,16 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (12 rows)
 
@@ -1567,12 +1567,12 @@ explain (costs off)
        QUERY PLAN        
 -------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (8 rows)
 
@@ -1586,15 +1586,15 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
-   Hash Key: thousand
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
+   Hash Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (11 rows)
 
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..7818f02032 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -340,8 +340,8 @@ SELECT c, sum(a) FROM pagg_tab GROUP BY rollup(c) ORDER BY 1, 2;
  Sort
    Sort Key: pagg_tab.c, (sum(pagg_tab.a))
    ->  MixedAggregate
-         Hash Key: pagg_tab.c
          Group Key: ()
+         Hash Key: pagg_tab.c
          ->  Append
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2
-- 
2.21.1

0005-fix.patchtext/plain; charset=us-asciiDownload

From df0f50a9bdf32def4a8768ad03e75e6a6b42b249 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 02:54:39 +0100
Subject: [PATCH 5/7] fix

---
 .../postgres_fdw/expected/postgres_fdw.out    |  4 ++--
 src/backend/executor/execExpr.c               |  5 +++--
 src/backend/executor/nodeAgg.c                | 20 +++++++++----------
 src/backend/optimizer/util/pathnode.c         |  4 ++--
 4 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 62c2697920..fc0ed2f4d5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3448,8 +3448,8 @@ select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls la
    Sort Key: ft1.c2
    ->  MixedAggregate
          Output: c2, sum(c1)
-         Hash Key: ft1.c2
          Group Key: ()
+         Hash Key: ft1.c2
          ->  Foreign Scan on public.ft1
                Output: c2, c1
                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
@@ -3473,8 +3473,8 @@ select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last
    Sort Key: ft1.c2
    ->  MixedAggregate
          Output: c2, sum(c1)
-         Hash Key: ft1.c2
          Group Key: ()
+         Hash Key: ft1.c2
          ->  Foreign Scan on public.ft1
                Output: c2, c1
                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 07789501f7..669843faf5 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -2937,7 +2937,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck)
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
-	ListCell	*lc;
+	ListCell   *lc;
 	LastAttnumInfo deform = {0, 0, 0};
 
 	state->expr = (Expr *) aggstate;
@@ -2978,6 +2978,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck)
 		NullableDatum *strictargs = NULL;
 		bool	   *strictnulls = NULL;
 		int			argno;
+		int			setno;
 		ListCell   *bail;
 
 		/*
@@ -3155,7 +3156,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck)
 		 * grouping set). Do so for both sort and hash based computations, as
 		 * applicable.
 		 */
-		for (int setno = 0; setno < phase->numsets; setno++)
+		for (setno = 0; setno < phase->numsets; setno++)
 		{
 			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 								  pertrans, transno, setno, phase, nullcheck);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 20c5eb98b3..38d0bd5895 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -333,7 +333,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	AggStatePerPhaseSort persort;
 
 	Assert(newphase == 0 || newphase == aggstate->current_phase + 1);
-	
+
 	/* Don't use aggstate->phase here, it might not be initialized yet*/
 	current_phase = aggstate->phases[aggstate->current_phase];
 
@@ -1516,7 +1516,7 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
  * When called, CurrentMemoryContext should be the per-query context. The
  * already-calculated hash value for the tuple must be specified.
  */
-static void 
+static void
 lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash, uint32 hash)
 {
 	TupleTableSlot *hashslot = perhash->hashslot;
@@ -1724,7 +1724,7 @@ agg_retrieve_direct(AggState *aggstate)
 					numGroupingSets = aggstate->phase->numsets;
 					node = aggstate->phase->aggnode;
 					numReset = numGroupingSets;
-					pergroups = aggstate->phase->pergroups; 
+					pergroups = aggstate->phase->pergroups;
 				}
 				else
 				{
@@ -2123,7 +2123,7 @@ agg_retrieve_hash_table(AggState *aggstate)
 				 */
 				select_current_set(aggstate, 0, true);
 				initialize_phase(aggstate, aggstate->current_phase + 1);
-				perhash = (AggStatePerPhaseHash) aggstate->phase;	
+				perhash = (AggStatePerPhaseHash) aggstate->phase;
 				ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 				continue;
@@ -2269,7 +2269,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = 1 + list_length(node->chain);
-	
+
 	/*
 	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
@@ -2390,7 +2390,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numaggs = aggstate->numaggs;
 	Assert(numaggs == list_length(aggstate->aggs));
 
-	/* 
+	/*
 	 * For each phase, prepare grouping set data and fmgr lookup data for
 	 * compare functions.  Accumulate all_grouped_cols in passing.
 	 */
@@ -2430,7 +2430,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
 
-			/* 
+			/*
 			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
 			 * on the fly, all pergroup states are kept in hashtable, everytime
 			 * a tuple is processed, lookup_hash_entry() choose one group and
@@ -2499,7 +2499,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				phasedata->grouped_cols = NULL;
 			}
 
-			/* 
+			/*
 			 * Initialize pergroup states for AGG_SORTED/AGG_PLAIN/AGG_MIXED
 			 * phases, each set only have one group on the fly, all groups in
 			 * a set can reuse a pergroup state. Unlike AGG_HASHED, we
@@ -3610,8 +3610,8 @@ ExecReScanAgg(AggState *node)
 					   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
-		/* 
-		 * the agg did its own first sort using tuplesort and the first
+		/*
+		 * The agg did its own first sort using tuplesort and the first
 		 * tuplesort is kept (see initialize_phase), if the subplan does
 		 * not have any parameter changes, and none of our own parameter
 		 * changes affect input expressions of the aggregated functions,
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 2dfa3fa17e..ff8f676dfb 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2983,7 +2983,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
- * 'is_sorted' is the input sorted in the group cols of first rollup 
+ * 'is_sorted' is the input sorted in the group cols of first rollup
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -3098,7 +3098,7 @@ create_groupingsets_path(PlannerInfo *root,
 			AggStrategy	rollup_strategy;
 			Path	sort_path;	/* dummy for result of cost_sort */
 			Path	agg_path;	/* dummy for result of cost_agg */
-			
+
 			sort_path.startup_cost = 0;
 			sort_path.total_cost = 0;
 			sort_path.rows = subpath->rows;
-- 
2.21.1

0006-Parallel-grouping-sets.patchtext/plain; charset=iso-8859-1Download

From 243965f12c16c8ded348255254a179ec198f6812 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 03:00:50 +0100
Subject: [PATCH 6/7] Parallel grouping sets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets using multiple workers.

the main idea of parallel grouping sets is: like parallel aggregate,Â  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multipleÂ grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

TheÂ final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizationsÂ in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.
---
 src/backend/commands/explain.c          |  10 +-
 src/backend/executor/execExpr.c         |  10 +-
 src/backend/executor/execExprInterp.c   |  11 +
 src/backend/executor/nodeAgg.c          | 261 +++++++++++++++++-
 src/backend/jit/llvm/llvmjit_expr.c     |  40 +++
 src/backend/nodes/copyfuncs.c           |  56 +++-
 src/backend/nodes/equalfuncs.c          |   3 +
 src/backend/nodes/nodeFuncs.c           |   8 +
 src/backend/nodes/outfuncs.c            |  14 +-
 src/backend/nodes/readfuncs.c           |  53 +++-
 src/backend/optimizer/path/allpaths.c   |   5 +-
 src/backend/optimizer/plan/createplan.c |  26 +-
 src/backend/optimizer/plan/planner.c    | 343 ++++++++++++++++++------
 src/backend/optimizer/plan/setrefs.c    |  16 ++
 src/backend/optimizer/util/pathnode.c   |  27 +-
 src/backend/utils/adt/ruleutils.c       |   6 +
 src/include/executor/execExpr.h         |   1 +
 src/include/executor/nodeAgg.h          |   2 +
 src/include/nodes/execnodes.h           |   8 +-
 src/include/nodes/nodes.h               |   1 +
 src/include/nodes/pathnodes.h           |   2 +
 src/include/nodes/plannodes.h           |   4 +-
 src/include/nodes/primnodes.h           |   6 +
 src/include/optimizer/pathnode.h        |   1 +
 src/include/optimizer/planmain.h        |   2 +-
 25 files changed, 791 insertions(+), 125 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2c63cdb46c..8b6877c41e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2256,12 +2256,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->gsetid)
+		show_expression((Node *) plan->gsetid, "Filtered by",
+						(PlanState *) astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(plan, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2312,7 +2316,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 669843faf5..bf69fcfe97 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -815,7 +815,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -824,6 +824,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index b0dbba4e55..b3537eb8d9 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -428,6 +428,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -1512,6 +1513,16 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			AggState   *aggstate = castNode(AggState, state->parent);
+
+			*op->resvalue = aggstate->phase->setno_gsetids[aggstate->current_set];
+			*op->resnull = false;
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 38d0bd5895..f7b98dd798 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -282,6 +282,7 @@ static void lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash,
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static void agg_sort_input(AggState *aggstate);
+static void agg_preprocess_groupingsets(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static Datum GetAggInitVal(Datum textInitVal, Oid transtype);
 static void build_pertrans_for_aggref(AggStatePerTrans pertrans,
@@ -341,17 +342,26 @@ initialize_phase(AggState *aggstate, int newphase)
 	 * Whatever the previous state, we're now done with whatever input
 	 * tuplesort was in use, cleanup them.
 	 *
-	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * Note: we keep the first tuplesort/tuplestore when it's not the
+	 * final stage of partial groupingsets, this will benifit the
 	 * rescan in some cases without resorting the input again.
 	 */
-	if (!current_phase->is_hashed && aggstate->current_phase > 0)
+	if (!current_phase->is_hashed &&
+		(aggstate->current_phase > 0 || DO_AGGSPLIT_COMBINE(aggstate->aggsplit)))
 	{
 		persort = (AggStatePerPhaseSort) current_phase;
+
 		if (persort->sort_in)
 		{
 			tuplesort_end(persort->sort_in);
 			persort->sort_in = NULL;
 		}
+
+		if (persort->store_in)
+		{
+			tuplestore_end(persort->store_in);
+			persort->store_in = NULL;	
+		}
 	}
 
 	/* advance to next phase */
@@ -420,6 +430,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (current_phase->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(current_phase->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -1597,6 +1616,9 @@ ExecAgg(PlanState *pstate)
 
 	CHECK_FOR_INTERRUPTS();
 
+	if (node->groupingsets_preprocess)
+		agg_preprocess_groupingsets(node);
+
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -1637,7 +1659,7 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	bool		hasGroupingSets = aggstate->phase->aggnode->rollup != NULL;
 	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
@@ -1970,6 +1992,135 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+/*
+ * Routine for final phase of partial grouping sets:
+ *
+ * Preprocess tuples for final phase of grouping sets. In initial phase,
+ * tuples is decorated with a grouping set ID and in the final phase, all
+ * grouping set are fit into different aggregate phases, so we need to
+ * redirect the tuples to each aggregate phases according to the grouping
+ * set ID.
+ */
+static void
+agg_preprocess_groupingsets(AggState *aggstate)
+{
+	AggStatePerPhaseSort	persort;
+	AggStatePerPhaseHash	perhash;
+	AggStatePerPhase	phase;
+	TupleTableSlot		*outerslot;
+	ExprContext			*tmpcontext = aggstate->tmpcontext;
+	int					phaseidx;
+
+	Assert(DO_AGGSPLIT_COMBINE(aggstate->aggsplit));
+	Assert(aggstate->groupingsets_preprocess);
+
+	/* Initialize tuples storage for each aggregate phases */
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
+	{
+		phase = aggstate->phases[phaseidx];	
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+			if (phase->aggnode->sortnode)
+			{
+				Sort	   *sortnode = (Sort *) phase->aggnode->sortnode;
+				PlanState  *outerNode = outerPlanState(aggstate);
+				TupleDesc	tupDesc = ExecGetResultType(outerNode);
+
+				persort->sort_in = tuplesort_begin_heap(tupDesc,
+														sortnode->numCols,
+														sortnode->sortColIdx,
+														sortnode->sortOperators,
+														sortnode->collations,
+														sortnode->nullsFirst,
+														work_mem,
+														NULL, false);
+			}
+			else
+			{
+				persort->store_in = tuplestore_begin_heap(false, false, work_mem);	
+			}
+		}
+		else
+		{
+			/* 
+			 * If it's a AGG_HASHED, we don't need a storage to store
+			 * the tuples for later process, we can do the transition
+			 * immediately.
+			 */
+		}
+	}
+
+	for (;;)
+	{
+		Datum	ret;
+		bool	isNull;
+		int		setid;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tmpcontext->ecxt_outertuple = outerslot;
+
+		/* Finger out which group set the tuple belongs to ?*/
+		ret = ExecEvalExprSwitchContext(aggstate->gsetid, tmpcontext, &isNull);
+
+		setid = DatumGetInt32(ret);
+		phase = aggstate->phases[aggstate->gsetid_phaseidxs[setid]];
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+
+			Assert(persort->sort_in || persort->store_in);
+
+			if (persort->sort_in)
+				tuplesort_puttupleslot(persort->sort_in, outerslot);
+			else if (persort->store_in)
+				tuplestore_puttupleslot(persort->store_in, outerslot);
+		}
+		else
+		{
+			int		hash;
+			bool	dummynull;
+
+			perhash = (AggStatePerPhaseHash) phase;
+
+			/* If it is hashed, we can do the transition now. */
+			select_current_set(aggstate, 0, true);
+			prepare_hash_slot(aggstate, perhash);
+			hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
+			lookup_hash_entry(aggstate, perhash, hash);
+
+			ExecEvalExprSwitchContext(phase->evaltrans,
+									  tmpcontext,
+									  &dummynull);
+		}
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Sort the first phase if needed */
+	if (aggstate->aggstrategy != AGG_HASHED)
+	{
+		persort = (AggStatePerPhaseSort) aggstate->phase;
+
+		if (persort->sort_in)
+			tuplesort_performsort(persort->sort_in);
+	}
+
+	/* Mark the hash table to be filled */
+	aggstate->table_filled = true;
+
+	/* Mark the input table to be sorted */
+	aggstate->input_sorted = true;
+
+	/* Mark the flag to not preprocessing groupingsets again */
+	aggstate->groupingsets_preprocess = false;
+}
+
 static void
 agg_sort_input(AggState *aggstate)
 {
@@ -2246,21 +2397,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->input_sorted = true;
 	aggstate->eflags = eflags;
+	aggstate->groupingsets_preprocess = false;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			if (agg->aggstrategy != AGG_HASHED)
 				need_extra_slot = true;
@@ -2270,6 +2422,28 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = 1 + list_length(node->chain);
 
+	/* 
+	 * We are doing final stage of partial groupingsets, do preprocess
+	 * to input tuples first, redirect the tuples to according aggregate
+	 * phases. See agg_preprocess_groupingsets().
+	 */
+	if (node->rollup && DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		aggstate->groupingsets_preprocess = true;
+
+		/* 
+		 * Allocate gsetid <-> phases mapping, in final stage of
+		 * partial groupingsets, all grouping sets are extracted
+		 * to individual phases, so the number of sets is equal
+		 * to the number of phases
+		 */
+		aggstate->gsetid_phaseidxs =
+			(int *) palloc0(aggstate->numphases * sizeof(int));
+
+		if (aggstate->aggstrategy != AGG_HASHED)
+			need_extra_slot = true;
+	}
+
 	/*
 	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
@@ -2384,6 +2558,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->ss.ps.qual =
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
+	/*
+	 * Initialize expression state to fetch grouping set id from
+	 * the partial groupingsets aggregate result.
+	 */
+	aggstate->gsetid =
+		ExecInitExpr(node->gsetid, (PlanState *)aggstate);
 	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
@@ -2431,6 +2611,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
 
 			/*
+			 * In the initial stage of partial grouping sets, it provides extra
+			 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+			 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+			 * the output.
+			 */
+			if (aggnode->rollup &&
+				DO_AGGSPLIT_SERIALIZE(aggnode->aggsplit))
+			{
+				GroupingSetData	*gs;
+				phasedata->setno_gsetids = palloc(sizeof(int));
+				gs = linitial_node(GroupingSetData,
+								   aggnode->rollup->gsets_data);
+				phasedata->setno_gsetids[0] = gs->setId;
+			}
+
+			/* 
 			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
 			 * on the fly, all pergroup states are kept in hashtable, everytime
 			 * a tuple is processed, lookup_hash_entry() choose one group and
@@ -2448,8 +2644,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * we can do the transition immediately when a tuple is fetched,
 			 * which means we can do the transition concurrently with the
 			 * first phase.
+			 *
+			 * Note: this is not work for final phase of partial groupingsets in
+			 * which the partial input tuple has a specified target aggregate
+			 * phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				aggstate->phases[0]->concurrent_hashes =
 					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
@@ -2467,17 +2667,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (aggnode->groupingSets)
+			if (aggnode->rollup)
 			{
-				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->numsets = list_length(aggnode->rollup->gsets_data);
 				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
 				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
+				phasedata->setno_gsetids = palloc(phasedata->numsets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int		current_length = list_length(lfirst(l));
-					Bitmapset	*cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
+					Bitmapset *cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -2486,6 +2688,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
 
+					/* 
+					 * In the initial stage of partial grouping sets, it provides extra
+					 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+					 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+					 * the output.
+					 */
+					if (DO_AGGSPLIT_SERIALIZE(aggstate->aggsplit))
+						phasedata->setno_gsetids[i] = gs->setId;
+
 					++i;
 				}
 
@@ -2562,8 +2773,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * For non-first AGG_SORTED phase, it processes the same input
 			 * tuples with previous phase except that it need to resort the
 			 * input tuples. Tell the previous phase to copy out the tuples.
+			 *
+			 * Note: it doesn't work for final stage of partial grouping sets
+			 * in which tuple has specified target aggregate phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				AggStatePerPhaseSort prev =
 					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
@@ -2574,6 +2788,18 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 		}
 
+		/*
+		 * Fill the gsetid_phaseidxs array, so we can find according phases
+		 * using gsetid.
+		 */
+		if (aggstate->groupingsets_preprocess)
+		{
+			GroupingSetData *gs =
+				linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+			aggstate->gsetid_phaseidxs[gs->setId] = phaseidx;
+		}
+
 		aggstate->phases[phaseidx] = phasedata;
 	}
 
@@ -3461,6 +3687,8 @@ ExecEndAgg(AggState *node)
 		persort = (AggStatePerPhaseSort) phase;
 		if (persort->sort_in)
 			tuplesort_end(persort->sort_in);
+		if (persort->store_in)
+			tuplestore_end(persort->store_in);
 	}
 
 	for (transno = 0; transno < node->numtrans; transno++)
@@ -3643,6 +3871,13 @@ ExecReScanAgg(AggState *node)
 			}
 		}
 
+		/* 
+		 * if the agg is doing final stage of partial groupingsets, reset the
+		 * flag to do groupingsets preprocess again.
+		 */
+		if (aggnode->rollup && DO_AGGSPLIT_COMBINE(node->aggsplit))
+			node->groupingsets_preprocess = true;
+
 		/* reset to phase 0 */
 		initialize_phase(node, 0);
 
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 066cd59554..f442442269 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -1882,6 +1882,46 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_GROUPING_SET_ID:
+				{
+					LLVMValueRef v_resvalue;
+					LLVMValueRef v_aggstatep;
+					LLVMValueRef v_phase;
+					LLVMValueRef v_current_set;
+					LLVMValueRef v_setno_gsetids;
+
+					v_aggstatep =
+						LLVMBuildBitCast(b, v_parent, l_ptr(StructAggState), "");
+
+					/* 
+					 * op->resvalue =
+					 * aggstate->phase->setno_gsetids
+					 * [aggstate->current_set]
+					 */
+					v_phase =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_PHASE,
+										  "aggstate.phase");
+					v_setno_gsetids =
+						l_load_struct_gep(b, v_phase,
+										  FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS,
+										  "aggstateperphase.setno_gsetids");
+					v_current_set =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_CURRENT_SET,
+										  "aggstate.current_set");
+					v_resvalue =
+						l_load_gep1(b, v_setno_gsetids, v_current_set, "");
+					v_resvalue =
+						LLVMBuildZExt(b, v_resvalue, TypeSizeT, "");
+
+					LLVMBuildStore(b, v_resvalue, v_resvaluep);
+					LLVMBuildStore(b, l_sbool_const(0), v_resnullp);
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_WINDOW_FUNC:
 				{
 					WindowFuncExprState *wfunc = op->d.window_func.wfstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04b4c65858..de4dcfe165 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -990,8 +990,9 @@ _copyAgg(const Agg *from)
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_SCALAR_FIELD(transitionSpace);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(gsetid);
 	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
@@ -1478,6 +1479,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 	return newnode;
 }
 
+/*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(setId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
 /*
  * _copyWindowFunc
  */
@@ -4972,6 +5017,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5608,6 +5656,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 88b912977e..6aa71d3723 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3078,6 +3078,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index d85ca9f7c5..877ea0bc16 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -62,6 +62,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -740,6 +743,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1869,6 +1875,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2575,6 +2582,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5816d122c1..efcb1c7d4f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -785,8 +785,9 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_UINT64_FIELD(transitionSpace);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(gsetid);
 	WRITE_NODE_FIELD(sortnode);
 }
 
@@ -1150,6 +1151,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 	WRITE_LOCATION_FIELD(location);
 }
 
+static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
 static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
@@ -2002,6 +2010,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(setId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3847,6 +3856,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index af4fcfe1ee..c9a3340f58 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -636,6 +636,50 @@ _readGroupingFunc(void)
 	READ_DONE();
 }
 
+/*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(setId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
 /*
  * _readWindowFunc
  */
@@ -2205,8 +2249,9 @@ _readAgg(void)
 	READ_LONG_FIELD(numGroups);
 	READ_UINT64_FIELD(transitionSpace);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(gsetid);
 	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
@@ -2642,6 +2687,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..e6c7f080e0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2710,8 +2710,11 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
-	 * Merge.
+	 * Merge. Don't do this for partial groupingsets.
 	 */
+	if (root->parse->groupingSets)
+		return;
+
 	foreach(lc, rel->partial_pathlist)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 29f88bf0b7..64205893a3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1641,7 +1641,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 0,
@@ -2095,7 +2095,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
@@ -2214,7 +2214,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	 * never be grouping in an UPDATE/DELETE; but let's Assert that.
 	 */
 	Assert(root->inhTargetKind == INHKIND_NONE);
-	Assert(root->grouping_map == NULL);
 	root->grouping_map = grouping_map;
 
 	/*
@@ -2237,10 +2236,13 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
 			/*
+			 * In final stage, rollup may contain empty set here
+			 *
 			 * FIXME This combination of nested if checks needs some explanation
 			 * why we need this particular combination of flags.
 			 */
-			if (!rollup->is_hashed)
+			if (!rollup->is_hashed &&
+				list_length(linitial(rollup->gsets)) != 0)
 			{
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
@@ -2264,12 +2266,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
@@ -2281,8 +2283,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	}
 
 	/*
-	 * Now make the real Agg node
-	 */
+	 * Now make the real Agg node */
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
@@ -2314,12 +2315,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
@@ -6221,7 +6222,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 RollupData *rollup, List *chain, double dNumGroups,
 		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6240,8 +6241,9 @@ make_agg(List *tlist, List *qual,
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
+	node->gsetid = NULL;
 	node->sortnode = sortnode;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 68d9c88a53..cedd3e1c9d 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -113,6 +113,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   num_sets;
 } grouping_sets_data;
 
 /*
@@ -126,6 +127,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*AddPathCallback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -142,7 +145,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -176,7 +180,11 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
 										double dNumGroups,
-										AggStrategy strat);
+										List *havingQual,
+										AggStrategy strat,
+										AggSplit aggsplit,
+										AddPathCallback add_path_fn);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -250,6 +258,9 @@ static bool group_by_has_partkey(RelOptInfo *input_rel,
 								 List *groupClause);
 static int	common_prefix_cmp(const void *a, const void *b);
 
+static List *extract_final_rollups(PlannerInfo *root,
+								   grouping_sets_data *gd,
+								   List *rollups);
 
 /*****************************************************************************
  *
@@ -2494,6 +2505,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->setId = gd->num_sets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2538,7 +2550,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -3547,7 +3559,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3581,6 +3593,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->setId = gd->num_sets++;
 		result = lcons(gs, result);
 	}
 
@@ -4191,8 +4204,14 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * The caller specifies the preferred aggregate strategy (sorted or hashed) using
  * the strat aprameter. When the requested strategy is AGG_SORTED, the input path
  * needs to be sorted accordingly (is_sorted needs to be true).
+ *
+ * The caller also needs to specify a callback used to add the path to the
+ * appropriate list - we can't simply use add_path, because with partial
+ * aggregation (PARTITIONWISE_AGGREGATE_PARTIAL) the path may need to be
+ * added to grouped_rel->pathlist. And aggsplit value is not sufficient to
+ * make a decision.
  */
-static void
+static void 
 consider_groupingsets_paths(PlannerInfo *root,
 							RelOptInfo *grouped_rel,
 							Path *path,
@@ -4201,9 +4220,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
 							double dNumGroups,
-							AggStrategy strat)
+							List *havingQual,
+							AggStrategy strat,
+							AggSplit aggsplit,
+							AddPathCallback add_path_fn)
 {
-	Query	   *parse = root->parse;
 	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
@@ -4367,16 +4388,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			new_rollups = extract_final_rollups(root, gd, new_rollups);
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 strat,
+											 new_rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
 		return;
 	}
 
@@ -4388,7 +4413,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 	/*
 	 * Callers consider AGG_SORTED strategy, the first rollup must
-	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * use non-hashed aggregate, is_sorted tells whether the first
 	 * rollup need to do its own sort.
 	 *
 	 * we try and make two paths: one sorted and one mixed
@@ -4533,16 +4558,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups,
-											  is_sorted));
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+				rollups = extract_final_rollups(root, gd, rollups);
+
+			add_path_fn(grouped_rel, (Path *)
+						create_groupingsets_path(root,
+												 grouped_rel,
+												 path,
+												 havingQual,
+												 AGG_MIXED,
+												 rollups,
+												 agg_costs,
+												 dNumGroups,
+												 aggsplit,
+												 is_sorted));
 		}
 	}
 
@@ -4550,16 +4579,82 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+	{
+		List *rollups;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rollups = extract_final_rollups(root, gd, gd->rollups);
+		else
+			rollups = gd->rollups;
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 AGG_SORTED,
+											 rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
+	}
+}
+
+/* 
+ * If we are combining the partial groupingsets aggregation, the input is
+ * mixed with tuples from different grouping sets, executor dispatch the
+ * tuples to different rollups (phases) according to the grouping set id.
+ *
+ * We cannot use the same rollups with initial stage in which each tuple
+ * is processed by one or more grouping sets in one rollup, because in
+ * combining stage, each tuple only belong to one single grouping set.
+ * In this case, we use final_rollups instead in which each rollup has
+ * only one grouping set.
+ */
+static List *
+extract_final_rollups(PlannerInfo *root,
+					  grouping_sets_data *gd,
+					  List *rollups)
+{
+	ListCell	*lc;
+	List		*new_rollups = NIL;
+
+	foreach(lc, rollups)
+	{
+		ListCell	*lc1;
+		RollupData	*rollup = lfirst_node(RollupData, lc);
+
+		foreach(lc1, rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst_node(GroupingSetData, lc1);
+			RollupData *new_rollup = makeNode(RollupData);
+
+			if (gs->set != NIL)
+			{
+				new_rollup->groupClause = preprocess_groupclause(root, gs->set);
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = remap_to_groupclause_idx(new_rollup->groupClause,
+															 new_rollup->gsets_data,
+															 gd->tleref_to_colnum_map);
+				new_rollup->hashable = rollup->hashable;
+				new_rollup->is_hashed = rollup->is_hashed;
+			}
+			else
+			{
+				new_rollup->groupClause = NIL;
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = list_make1(NIL); 
+				new_rollup->hashable = false;
+				new_rollup->is_hashed = false;
+			}
+
+			new_rollup->numGroups = gs->numGroups;
+			new_rollups = lappend(new_rollups, new_rollup);
+		}
+	}
+
+	return new_rollups;
 }
 
 /*
@@ -5269,6 +5364,17 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
+	/* 
+	 * We are generate partial groupingsets path, add an expression to show
+	 * the grouping set ID for a tuple, so in the final stage, executor can
+	 * know which set this tuple belongs to and combine them.
+	 * */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
 	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
@@ -6433,7 +6539,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					consider_groupingsets_paths(root, grouped_rel,
 												path, is_sorted, can_hash,
 												gd, agg_costs, dNumGroups,
-												AGG_SORTED);
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_SIMPLE,
+												add_path);
 					continue;
 				}
 
@@ -6494,15 +6603,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
+
+				/*
+				 * Use any available suitably-sorted path as input, and also
+				 * consider sorting the cheapest-total path.
+				 */
+				if (path != partially_grouped_rel->cheapest_total_path &&
+					!is_sorted)
+					continue;
+
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_FINAL_DESERIAL,
+												add_path);
+					continue;
+				}
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
-					if (path != partially_grouped_rel->cheapest_total_path)
-						continue;
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6546,7 +6677,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
 										gd, agg_costs, dNumGroups,
-										AGG_HASHED);
+										havingQual,
+										AGG_HASHED,
+										AGGSPLIT_SIMPLE,
+										add_path);
 		}
 		else
 		{
@@ -6589,22 +6723,39 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGG_HASHED,
+											AGGSPLIT_FINAL_DESERIAL,
+											add_path);
+			}
+			else
+			{
 
-			if (hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
+
+				if (hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6814,6 +6965,19 @@ create_partial_grouping_paths(PlannerInfo *root,
 											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGG_SORTED,
+												AGGSPLIT_INITIAL_SERIAL,
+												add_partial_path);
+					continue;
+				}
+
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6821,7 +6985,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 													 path,
 													 root->group_pathkeys,
 													 -1.0);
-
+				
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
 									 create_agg_path(root,
@@ -6883,26 +7047,41 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if (hashaggtablesize < work_mem * 1024L &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGG_HASHED,
+										AGGSPLIT_INITIAL_SERIAL,
+										add_partial_path);
+		}
+		else 
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if (hashaggtablesize < work_mem * 1024L &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6946,6 +7125,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	generate_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
+	if (root->parse->groupingSets)
+		return;
+
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
 							   cheapest_partial_path->pathkeys))
@@ -6990,11 +7172,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..eae7d15701 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -754,6 +754,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If it's groupingsets, we must add expression to evaluate
+					 * the grouping set ID and set the reference from the
+					 * targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->gsetid = (Expr *) fix_upper_expr(root, (Node *)expr,
+															  subplan_itlist,
+															  OUTER_VAR,
+															  rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index ff8f676dfb..9fe6f6a003 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2994,6 +2994,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups,
+						 AggSplit aggsplit,
 						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
@@ -3011,6 +3012,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit = aggsplit;
 	pathnode->is_sorted = is_sorted;
 
 	/*
@@ -3045,11 +3047,27 @@ create_groupingsets_path(PlannerInfo *root,
 	Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
 	Assert(aggstrategy != AGG_MIXED || list_length(rollups) > 1);
 
+	/*
+	 * Estimate the cost of groupingsets.
+	 *
+	 * If we are finalizing grouping sets, the subpath->rows
+	 * contains rows from all sets, we need to estimate the
+	 * number of rows in each rollup. Meanwhile, the cost of
+	 * preprocess groupingsets is not estimated, the expression
+	 * to redirect tuples is a simple Var expression which is
+	 * normally cost zero.
+	 */
 	foreach(lc, rollups)
 	{
 		RollupData *rollup = lfirst(lc);
 		List	   *gsets = rollup->gsets;
 		int			numGroupCols = list_length(linitial(gsets));
+		int			rows = 0.0;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rows = rollup->numGroups * subpath->rows / numGroups;
+		else
+			rows = subpath->rows;
 
 		/*
 		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
@@ -3071,7 +3089,7 @@ create_groupingsets_path(PlannerInfo *root,
 
 				cost_sort(&sort_path, root, NIL,
 						  input_total_cost,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3089,7 +3107,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 input_startup_cost,
 					 input_total_cost,
-					 subpath->rows);
+					 rows);
 
 			is_first = false;
 		}
@@ -3101,7 +3119,6 @@ create_groupingsets_path(PlannerInfo *root,
 
 			sort_path.startup_cost = 0;
 			sort_path.total_cost = 0;
-			sort_path.rows = subpath->rows;
 
 			rollup_strategy = rollup->is_hashed ?
 				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
@@ -3111,7 +3128,7 @@ create_groupingsets_path(PlannerInfo *root,
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
 						  0.0,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3127,7 +3144,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 sort_path.startup_cost,
 					 sort_path.total_cost,
-					 sort_path.rows);
+					 rows);
 
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5e63238f03..5779d158ba 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -7941,6 +7941,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 4ed5d0a7de..4d36c2d77b 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index c5d4121c37..967af08af7 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -281,6 +281,8 @@ typedef struct AggStatePerPhaseData
 	List		*concurrent_hashes;	/* hash phases can do transition concurrently */
 	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
 	bool		skip_build_trans;
+#define FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS 10
+	int			*setno_gsetids;		/* setno <-> gsetid map */
 }			AggStatePerPhaseData;
 
 typedef struct AggStatePerPhaseSortData
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 4081a0978e..dea5b10597 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2047,6 +2047,7 @@ typedef struct AggState
 	int			numtrans;		/* number of pertrans items */
 	AggStrategy aggstrategy;	/* strategy mode */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+#define FIELDNO_AGGSTATE_PHASE 6
 	AggStatePerPhase phase;		/* pointer to current phase data */
 	int			numphases;		/* number of phases (including phase 0) */
 	int			current_phase;	/* current phase number */
@@ -2070,8 +2071,6 @@ typedef struct AggState
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
 	AggStatePerPhase *phases;	/* array of all phases */
-	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
-	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
@@ -2083,6 +2082,11 @@ typedef struct AggState
 	int			eflags;			/* eflags for the first sort */
 
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* these field are used in parallel grouping sets */
+	bool		groupingsets_preprocess; /* groupingsets preprocessed yet? */
+	ExprState	*gsetid;				/* expression state to get grpsetid from input */
+	int			*gsetid_phaseidxs;	/* grpsetid <-> phaseidx mapping */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..a48a7af0e3 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c1e69c808f..2761fa6d01 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1676,6 +1676,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			setId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1702,6 +1703,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3cd2537e9e..5b1239adf2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -816,8 +817,9 @@ typedef struct Agg
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Expr	   *gsetid;			/* expression to fetch grouping set id */
 	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d73be2ad46..f8f85d431a 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -364,6 +364,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* GroupingSetId */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f9f388ba06..4fde8b22bf 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups,
+												  AggSplit aggsplit,
 												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5954ff3997..e987011328 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 RollupData *rollup, List *chain, double dNumGroups,
 					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
-- 
2.21.1

0007-fix.patchtext/plain; charset=us-asciiDownload

From b87116ee21a59501ef9d9decfebf1cf5aa48d734 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 03:02:42 +0100
Subject: [PATCH 7/7] fix

---
 src/backend/executor/nodeAgg.c       | 20 ++++++++++----------
 src/backend/jit/llvm/llvmjit_expr.c  |  2 +-
 src/backend/optimizer/plan/planner.c | 12 ++++++------
 3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index f7b98dd798..51c7f229e2 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -360,7 +360,7 @@ initialize_phase(AggState *aggstate, int newphase)
 		if (persort->store_in)
 		{
 			tuplestore_end(persort->store_in);
-			persort->store_in = NULL;	
+			persort->store_in = NULL;
 		}
 	}
 
@@ -2017,7 +2017,7 @@ agg_preprocess_groupingsets(AggState *aggstate)
 	/* Initialize tuples storage for each aggregate phases */
 	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		phase = aggstate->phases[phaseidx];	
+		phase = aggstate->phases[phaseidx];
 
 		if (!phase->is_hashed)
 		{
@@ -2039,12 +2039,12 @@ agg_preprocess_groupingsets(AggState *aggstate)
 			}
 			else
 			{
-				persort->store_in = tuplestore_begin_heap(false, false, work_mem);	
+				persort->store_in = tuplestore_begin_heap(false, false, work_mem);
 			}
 		}
 		else
 		{
-			/* 
+			/*
 			 * If it's a AGG_HASHED, we don't need a storage to store
 			 * the tuples for later process, we can do the transition
 			 * immediately.
@@ -2422,7 +2422,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = 1 + list_length(node->chain);
 
-	/* 
+	/*
 	 * We are doing final stage of partial groupingsets, do preprocess
 	 * to input tuples first, redirect the tuples to according aggregate
 	 * phases. See agg_preprocess_groupingsets().
@@ -2431,7 +2431,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	{
 		aggstate->groupingsets_preprocess = true;
 
-		/* 
+		/*
 		 * Allocate gsetid <-> phases mapping, in final stage of
 		 * partial groupingsets, all grouping sets are extracted
 		 * to individual phases, so the number of sets is equal
@@ -2449,7 +2449,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
 	if (node->sortnode)
-		aggstate->input_sorted = false;	
+		aggstate->input_sorted = false;
 
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
@@ -2626,7 +2626,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				phasedata->setno_gsetids[0] = gs->setId;
 			}
 
-			/* 
+			/*
 			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
 			 * on the fly, all pergroup states are kept in hashtable, everytime
 			 * a tuple is processed, lookup_hash_entry() choose one group and
@@ -2688,7 +2688,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
 
-					/* 
+					/*
 					 * In the initial stage of partial grouping sets, it provides extra
 					 * grouping sets ID in the targetlist, fill the setno <-> gsetid
 					 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
@@ -3871,7 +3871,7 @@ ExecReScanAgg(AggState *node)
 			}
 		}
 
-		/* 
+		/*
 		 * if the agg is doing final stage of partial groupingsets, reset the
 		 * flag to do groupingsets preprocess again.
 		 */
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index f442442269..f70eaabd0c 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -1893,7 +1893,7 @@ llvm_compile_expr(ExprState *state)
 					v_aggstatep =
 						LLVMBuildBitCast(b, v_parent, l_ptr(StructAggState), "");
 
-					/* 
+					/*
 					 * op->resvalue =
 					 * aggstate->phase->setno_gsetids
 					 * [aggstate->current_set]
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index cedd3e1c9d..a0186091a1 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4211,7 +4211,7 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * added to grouped_rel->pathlist. And aggsplit value is not sufficient to
  * make a decision.
  */
-static void 
+static void
 consider_groupingsets_paths(PlannerInfo *root,
 							RelOptInfo *grouped_rel,
 							Path *path,
@@ -4601,7 +4601,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	}
 }
 
-/* 
+/*
  * If we are combining the partial groupingsets aggregation, the input is
  * mixed with tuples from different grouping sets, executor dispatch the
  * tuples to different rollups (phases) according to the grouping set id.
@@ -4644,7 +4644,7 @@ extract_final_rollups(PlannerInfo *root,
 			{
 				new_rollup->groupClause = NIL;
 				new_rollup->gsets_data = list_make1(gs);
-				new_rollup->gsets = list_make1(NIL); 
+				new_rollup->gsets = list_make1(NIL);
 				new_rollup->hashable = false;
 				new_rollup->is_hashed = false;
 			}
@@ -5364,7 +5364,7 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
-	/* 
+	/*
 	 * We are generate partial groupingsets path, add an expression to show
 	 * the grouping set ID for a tuple, so in the final stage, executor can
 	 * know which set this tuple belongs to and combine them.
@@ -6985,7 +6985,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 													 path,
 													 root->group_pathkeys,
 													 -1.0);
-				
+
 				if (parse->hasAggs)
 					add_partial_path(partially_grouped_rel, (Path *)
 									 create_agg_path(root,
@@ -7059,7 +7059,7 @@ create_partial_grouping_paths(PlannerInfo *root,
 										AGGSPLIT_INITIAL_SERIAL,
 										add_partial_path);
 		}
-		else 
+		else
 		{
 			hashaggtablesize =
 				estimate_hashagg_tablesize(cheapest_partial_path,
-- 
2.21.1

#25

Pengzhou Tang

ptang@pivotal.io

almost 6 years ago

In reply to: Tomas Vondra (#24)

Re: Parallel grouping sets

Thanks you to review this patch.

On Thu, Mar 19, 2020 at 10:09 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Hi,

unfortunately this got a bit broken by the disk-based hash aggregation,
committed today, and so it needs a rebase. I've started looking at the
patch before that, and I have it rebased on e00912e11a9e (i.e. the
commit before the one that breaks it).

I spent the day to look into the details of the hash spill patch and
finally can
successfully rebase it, I tested the first 5 patches and they all passed the
installcheck, the 0006-parallel-xxx.path is not tested yet and I also need
to
make hash spill work in the final stage of parallel grouping sets, will do
that
tomorrow.

the conflicts mainly located in the handling of hash spill for grouping
sets,
the 0004-reorganise-xxxx patch also make the refilling the hash table stage
easier and
can avoid the nullcheck in that stage.

Attached is the rebased patch series (now broken), with a couple of

commits with some minor cosmetic changes I propose to make (easier than
explaining it on a list, it's mostly about whitespace, comments etc).
Feel free to reject the changes, it's up to you.

Thanks, I will enhance the comments and take care of the whitespace.

I'll continue doing the review, but it'd be good to have a fully rebased

version.

Very appreciate it.

Thanks,
Pengzhou

#26

Pengzhou Tang

ptang@pivotal.io

almost 6 years ago

In reply to: Tomas Vondra (#24)

5 attachment(s)

Re: Parallel grouping sets

Hi Tomas,

I rebased the code and resolved the comments you attached, some unresolved
comments are explained in 0002-fixes.patch, please take a look.

I also make the hash spill working for parallel grouping sets, the plan
looks like:

gpadmin=# explain select g100, g10, sum(g::numeric), count(*), max(g::text)
from gstest_p group by cube (g100,g10);
QUERY PLAN
-------------------------------------------------------------------------------------------
Finalize MixedAggregate (cost=1000.00..7639.95 rows=1111 width=80)
Filtered by: (GROUPINGSETID())
Group Key: ()
Hash Key: g100, g10
Hash Key: g100
Hash Key: g10
Planned Partitions: 4
-> Gather (cost=1000.00..6554.34 rows=7777 width=84)
Workers Planned: 7
-> Partial MixedAggregate (cost=0.00..4776.64 rows=1111 width=84)
Group Key: ()
Hash Key: g100, g10
Hash Key: g100
Hash Key: g10
Planned Partitions: 4
-> Parallel Seq Scan on gstest_p (cost=0.00..1367.71
rows=28571 width=12)
(16 rows)

Thanks,
Pengzhou

On Thu, Mar 19, 2020 at 10:09 AM Tomas Vondra <tomas.vondra@2ndquadrant.com>
wrote:

Show quoted text

Hi,

unfortunately this got a bit broken by the disk-based hash aggregation,
committed today, and so it needs a rebase. I've started looking at the
patch before that, and I have it rebased on e00912e11a9e (i.e. the
commit before the one that breaks it).

Attached is the rebased patch series (now broken), with a couple of
commits with some minor cosmetic changes I propose to make (easier than
explaining it on a list, it's mostly about whitespace, comments etc).
Feel free to reject the changes, it's up to you.

I'll continue doing the review, but it'd be good to have a fully rebased
version.

regards

--
Tomas Vondra
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.2ndQuadrant.com&d=DwIBAg&c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&r=L968W84_Yb9HJKtAAZUSYw&m=hYswOh9Appfj1CipZAY8-RyPSLWnua0VLEaMDCJ2L3s&s=iYybgoMynB_mcwDfPDmJv3afu-Xdis45lMkS-_6LGnQ&e=
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

0001-All-grouping-sets-do-their-own-sorting.patchapplication/octet-stream; name=0001-All-grouping-sets-do-their-own-sorting.patchDownload

From 8a6add3e2246e2019be647e436f05dc9abcf271e Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:07:29 -0400
Subject: [PATCH 1/5] All grouping sets do their own sorting
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PG used to add a SORT path explicitly beneath the AGG for sort aggregate,
grouping sets path also add a SORT path for the first sort aggregate phase,
but the following sort aggregate phases do their own sorting using a tuplesort.
This commit unified the way how grouping sets path doing sort, all sort aggregate
phases now do their own sorting using tuplesort.

This commit is mainly a preparing step to support parallel grouping sets, the
main idea of parallel grouping sets is: like parallel aggregate,  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multiple grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

The final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizations in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.

Obviously, adding a SORT path underneath the AGG in the final stage is not
right. This commit can avoid it and all non-hashed aggregate phases can do
their own sorting after thetuples are redirected.
---
 src/backend/commands/explain.c             |   5 +-
 src/backend/executor/nodeAgg.c             |  79 +++++++++++---
 src/backend/nodes/copyfuncs.c              |   1 +
 src/backend/nodes/outfuncs.c               |   1 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/createplan.c    |  65 ++++++++----
 src/backend/optimizer/plan/planner.c       |  66 ++++++++----
 src/backend/optimizer/util/pathnode.c      |  30 +++++-
 src/include/executor/nodeAgg.h             |   2 -
 src/include/nodes/execnodes.h              |   5 +-
 src/include/nodes/pathnodes.h              |   1 +
 src/include/nodes/plannodes.h              |   1 +
 src/include/optimizer/pathnode.h           |   3 +-
 src/include/optimizer/planmain.h           |   2 +-
 src/test/regress/expected/groupingsets.out | 161 ++++++++++++++---------------
 15 files changed, 275 insertions(+), 148 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 58141d8393..8c82d6ea95 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2291,15 +2291,14 @@ show_grouping_sets(PlanState *planstate, Agg *agg,
 
 	ExplainOpenGroup("Grouping Sets", "Grouping Sets", false, es);
 
-	show_grouping_set_keys(planstate, agg, NULL,
+	show_grouping_set_keys(planstate, agg, (Sort *) agg->sortnode,
 						   context, useprefix, ancestors, es);
 
 	foreach(lc, agg->chain)
 	{
 		Agg		   *aggnode = lfirst(lc);
-		Sort	   *sortnode = (Sort *) aggnode->plan.lefttree;
 
-		show_grouping_set_keys(planstate, aggnode, sortnode,
+		show_grouping_set_keys(planstate, aggnode, (Sort *) aggnode->sortnode,
 							   context, useprefix, ancestors, es);
 	}
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 44c159ab2a..bf484e19ec 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -403,6 +403,7 @@ static int hash_choose_num_partitions(uint64 input_groups,
 static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static bool agg_refill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
@@ -515,7 +516,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
+		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -2108,6 +2109,8 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (!node->input_sorted)
+					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -2465,6 +2468,45 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static void
+agg_sort_input(AggState *aggstate)
+{
+	AggStatePerPhase phase = &aggstate->phases[1];
+	TupleDesc	tupDesc;
+	Sort		*sortnode;
+
+	Assert(!aggstate->input_sorted);
+	Assert(phase->aggnode->sortnode);
+
+	sortnode = (Sort *) phase->aggnode->sortnode;
+	tupDesc = ExecGetResultType(outerPlanState(aggstate));
+
+	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
+											 sortnode->numCols,
+											 sortnode->sortColIdx,
+											 sortnode->sortOperators,
+											 sortnode->collations,
+											 sortnode->nullsFirst,
+											 work_mem,
+											 NULL, false);
+	for (;;)
+	{
+		TupleTableSlot *outerslot;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+	}
+
+	/* Sort the first phase */
+	tuplesort_performsort(aggstate->sort_in);
+
+	/* Mark the input to be sorted */
+	aggstate->input_sorted = true;
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -3133,6 +3175,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
+	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
@@ -3177,6 +3220,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_sorted = true;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -3184,6 +3228,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numPhases = (use_hashing ? 1 : 2);
 	numHashes = (use_hashing ? 1 : 0);
 
+	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.  Also calculate the number of
@@ -3205,7 +3251,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * others add an extra phase.
 			 */
 			if (agg->aggstrategy != AGG_HASHED)
+			{
 				++numPhases;
+
+				if (!firstSortAgg)
+					firstSortAgg = agg;
+
+			}
 			else
 				++numHashes;
 		}
@@ -3214,6 +3266,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = numPhases;
 
+	/*
+	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * agg_sort_input(), this can only happen in groupingsets case.
+	 */
+	if (firstSortAgg && firstSortAgg->sortnode)
+		aggstate->input_sorted = false;	
+
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
@@ -3275,7 +3334,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		!aggstate->input_sorted)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -3346,20 +3406,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
 	{
 		Agg		   *aggnode;
-		Sort	   *sortnode;
 
 		if (phaseidx > 0)
-		{
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
-			sortnode = castNode(Sort, aggnode->plan.lefttree);
-		}
 		else
-		{
 			aggnode = node;
-			sortnode = NULL;
-		}
-
-		Assert(phase <= 1 || sortnode);
 
 		if (aggnode->aggstrategy == AGG_HASHED
 			|| aggnode->aggstrategy == AGG_MIXED)
@@ -3476,7 +3527,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
-			phasedata->sortnode = sortnode;
 		}
 	}
 
@@ -4611,6 +4661,10 @@ ExecReScanAgg(AggState *node)
 				   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
+		/* Reset input_sorted */
+		if (aggnode->sortnode)
+			node->input_sorted = false;
+
 		/* reset to phase 1 */
 		initialize_phase(node, 1);
 
@@ -4618,6 +4672,7 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
+
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f753..04b4c65858 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -992,6 +992,7 @@ _copyAgg(const Agg *from)
 	COPY_BITMAPSET_FIELD(aggParams);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f069..5816d122c1 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -787,6 +787,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_BITMAPSET_FIELD(aggParams);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(sortnode);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3479..af4fcfe1ee 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2207,6 +2207,7 @@ _readAgg(void)
 	READ_BITMAPSET_FIELD(aggParams);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908dc6..d5b34089aa 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1645,6 +1645,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 NIL,
 								 best_path->path.rows,
 								 0,
+								 NULL,
 								 subplan);
 	}
 	else
@@ -2098,6 +2099,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
+					NULL,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2159,6 +2161,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
 	int			maxref;
+	int			flags = CP_LABEL_TLIST;
 	List	   *chain;
 	ListCell   *lc;
 
@@ -2168,9 +2171,15 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available
+	 * we do need grouping columns to be available; If the groupingsets need
+	 * to sort the input, the agg will store the input rows in a tuplesort,
+	 * it therefore behooves us to request a small tlist to avoid wasting
+	 * spaces.
 	 */
-	subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST);
+	if (!best_path->is_sorted)
+		flags = flags | CP_SMALL_TLIST;
+
+	subplan = create_plan_recurse(root, best_path->subpath, flags);
 
 	/*
 	 * Compute the mapping from tleSortGroupRef to column index in the child's
@@ -2230,12 +2239,22 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-			if (!rollup->is_hashed && !is_first_sort)
+			if (!rollup->is_hashed)
 			{
-				sort_plan = (Plan *)
-					make_sort_from_groupcols(rollup->groupClause,
-											 new_grpColIdx,
-											 subplan);
+				if (!is_first_sort ||
+					(is_first_sort && !best_path->is_sorted))
+				{
+					sort_plan = (Plan *)
+						make_sort_from_groupcols(rollup->groupClause,
+												 new_grpColIdx,
+												 subplan);
+
+					/*
+					 * Remove stuff we don't need to avoid bloating debug output.
+					 */
+					sort_plan->targetlist = NIL;
+					sort_plan->lefttree = NULL;
+				}
 			}
 
 			if (!rollup->is_hashed)
@@ -2260,16 +2279,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
-										 sort_plan);
-
-			/*
-			 * Remove stuff we don't need to avoid bloating debug output.
-			 */
-			if (sort_plan)
-			{
-				sort_plan->targetlist = NIL;
-				sort_plan->lefttree = NULL;
-			}
+										 sort_plan,
+										 NULL);
 
 			chain = lappend(chain, agg_plan);
 		}
@@ -2281,10 +2292,26 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
+		Plan	   *sort_plan = NULL;
 		int			numGroupCols;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+		/* the input is not sorted yet */
+		if (!rollup->is_hashed &&
+			!best_path->is_sorted)
+		{
+			sort_plan = (Plan *)
+				make_sort_from_groupcols(rollup->groupClause,
+										 top_grpColIdx,
+										 subplan);
+			/*
+			 * Remove stuff we don't need to avoid bloating debug output.
+			 */
+			sort_plan->targetlist = NIL;
+			sort_plan->lefttree = NULL;
+		}
+
 		numGroupCols = list_length((List *) linitial(rollup->gsets));
 
 		plan = make_agg(build_path_tlist(root, &best_path->path),
@@ -2299,6 +2326,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
+						sort_plan,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
@@ -6197,7 +6225,7 @@ make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 		 List *groupingSets, List *chain, double dNumGroups,
-		 Size transitionSpace, Plan *lefttree)
+		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
 	Plan	   *plan = &node->plan;
@@ -6217,6 +6245,7 @@ make_agg(List *tlist, List *qual,
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
 	node->groupingSets = groupingSets;
 	node->chain = chain;
+	node->sortnode = sortnode;
 
 	plan->qual = qual;
 	plan->targetlist = tlist;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index eb25c2f470..b7858e8d02 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -175,7 +175,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggStrategy strat);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -4186,6 +4187,14 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * it, by combinations of hashing and sorting.  This can be called multiple
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
+ *
+ * - strat:
+ *   preferred aggregate strategy to use.
+ * 
+ * - is_sorted:
+ *   Is the input sorted on the groupCols of the first rollup. Caller
+ *   must set it correctly if strat is set to AGG_SORTED, the planner
+ *   uses it to generate a sortnode.
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4195,13 +4204,15 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggStrategy strat)
 {
 	Query	   *parse = root->parse;
+	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
-	 * If we're not being offered sorted input, then only consider plans that
-	 * can be done entirely by hashing.
+	 * If strat is AGG_HASHED, then only consider plans that can be done
+	 * entirely by hashing.
 	 *
 	 * We can hash everything if it looks like it'll fit in work_mem. But if
 	 * the input is actually sorted despite not being advertised as such, we
@@ -4210,7 +4221,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * If none of the grouping sets are sortable, then ignore the work_mem
 	 * limit and generate a path anyway, since otherwise we'll just fail.
 	 */
-	if (!is_sorted)
+	if (strat == AGG_HASHED)
 	{
 		List	   *new_rollups = NIL;
 		RollupData *unhashed_rollup = NULL;
@@ -4251,6 +4262,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
+			/* update is_sorted to true */
+			is_sorted = true;
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4349,6 +4362,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
+			/* update is_sorted to true */
+			is_sorted = true;
 			strat = AGG_MIXED;
 		}
 
@@ -4360,18 +4375,23 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  strat,
 										  new_rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 		return;
 	}
 
 	/*
-	 * If we have sorted input but nothing we can do with it, bail.
+	 * Strategy is AGG_SORTED but nothing we can do with it, bail.
 	 */
 	if (list_length(gd->rollups) == 0)
 		return;
 
 	/*
-	 * Given sorted input, we try and make two paths: one sorted and one mixed
+	 * Callers consider AGG_SORTED strategy, the first rollup must
+	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * rollup need to do its own sort.
+	 *
+	 * we try and make two paths: one sorted and one mixed
 	 * sort/hash. (We need to try both because hashagg might be disabled, or
 	 * some columns might not be sortable.)
 	 *
@@ -4428,7 +4448,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that matches the input sort order.  We assign indexes "i"
+			 * one that need to be sorted.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -4517,7 +4537,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  AGG_MIXED,
 											  rollups,
 											  agg_costs,
-											  dNumGroups));
+											  dNumGroups,
+											  is_sorted));
 		}
 	}
 
@@ -4533,7 +4554,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  AGG_SORTED,
 										  gd->rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 }
 
 /*
@@ -6400,6 +6422,16 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					/* consider AGG_SORTED strategy */
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_costs, dNumGroups,
+												AGG_SORTED);
+					continue;
+				}
+
 				/* Sort the cheapest-total path if it isn't already sorted */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6408,14 +6440,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 root->group_pathkeys,
 													 -1.0);
 
-				/* Now decide what to stick atop it */
-				if (parse->groupingSets)
-				{
-					consider_groupingsets_paths(root, grouped_rel,
-												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
-				}
-				else if (parse->hasAggs)
+				if (parse->hasAggs)
 				{
 					/*
 					 * We have aggregation, possibly with plain GROUP BY. Make
@@ -6515,7 +6540,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										AGG_HASHED);
 		}
 		else
 		{
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122ee2..6e8899227f 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2984,6 +2984,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
+ * 'is_sorted' is the input sorted in the group cols of first rollup
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -2993,7 +2994,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3011,6 +3013,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->is_sorted = is_sorted;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
@@ -3062,14 +3065,33 @@ create_groupingsets_path(PlannerInfo *root,
 		 */
 		if (is_first)
 		{
+			Cost	input_startup_cost = subpath->startup_cost;
+			Cost	input_total_cost = subpath->total_cost;
+
+			if (!rollup->is_hashed && !is_sorted && numGroupCols)
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				cost_sort(&sort_path, root, NIL,
+						  input_total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0);
+
+				input_startup_cost = sort_path.startup_cost;
+				input_total_cost = sort_path.total_cost;
+			}
+
 			cost_agg(&pathnode->path, root,
 					 aggstrategy,
 					 agg_costs,
 					 numGroupCols,
 					 rollup->numGroups,
 					 having_qual,
-					 subpath->startup_cost,
-					 subpath->total_cost,
+					 input_startup_cost,
+					 input_total_cost,
 					 subpath->rows,
 					 subpath->pathtarget->width);
 			is_first = false;
@@ -3081,7 +3103,7 @@ create_groupingsets_path(PlannerInfo *root,
 			Path		sort_path;	/* dummy for result of cost_sort */
 			Path		agg_path;	/* dummy for result of cost_agg */
 
-			if (rollup->is_hashed || is_first_sort)
+			if (rollup->is_hashed || (is_first_sort && is_sorted))
 			{
 				/*
 				 * Account for cost of aggregation, but don't charge input
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index a5b8a004d1..9e70bd8b84 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -277,8 +277,6 @@ typedef struct AggStatePerPhaseData
 	ExprState **eqfunctions;	/* expression returning equality, indexed by
 								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
-	Sort	   *sortnode;		/* Sort node for input ordering for phase */
-
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
 
 	/* cached variants of the compiled expression */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50f09..75a45b2549 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2103,8 +2103,11 @@ typedef struct AggState
 	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
 										 * per-group pointers */
 
+	/* these fields are used in AGG_SORTED and AGG_MIXED */
+	bool		input_sorted;	/* hash table filled yet? */
+
 	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 50
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809644..c1e69c808f 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1702,6 +1702,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7b6d..3cd2537e9e 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -818,6 +818,7 @@ typedef struct Agg
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
 	List	   *groupingSets;	/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
 /* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe112a..f9f388ba06 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 4781201001..5954ff3997 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 					 List *groupingSets, List *chain, double dNumGroups,
-					 Size transitionSpace, Plan *lefttree);
+					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
 /*
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index dbe5140b55..1acbbfad55 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -366,15 +366,14 @@ explain (costs off)
 select g as alias1, g as alias2
   from generate_series(1,3) g
  group by alias1, rollup(alias2);
-                   QUERY PLAN                   
-------------------------------------------------
+                QUERY PLAN                
+------------------------------------------
  GroupAggregate
-   Group Key: g, g
-   Group Key: g
-   ->  Sort
-         Sort Key: g
-         ->  Function Scan on generate_series g
-(6 rows)
+   Sort Key: g, g
+     Group Key: g, g
+     Group Key: g
+   ->  Function Scan on generate_series g
+(5 rows)
 
 select g as alias1, g as alias2
   from generate_series(1,3) g
@@ -640,15 +639,14 @@ select a, b, sum(v.x)
 -- Test reordering of grouping sets
 explain (costs off)
 select * from gstest1 group by grouping sets((a,b,v),(v)) order by v,b,a;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-   Group Key: "*VALUES*".column3
-   ->  Sort
-         Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-         ->  Values Scan on "*VALUES*"
-(6 rows)
+   Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3
+   ->  Values Scan on "*VALUES*"
+(5 rows)
 
 -- Agg level check. This query should error out.
 select (select grouping(a,b) from gstest2) from gstest2 group by a,b;
@@ -723,13 +721,12 @@ explain (costs off)
             QUERY PLAN            
 ----------------------------------
  GroupAggregate
-   Group Key: a
-   Group Key: ()
+   Sort Key: a
+     Group Key: a
+     Group Key: ()
    Filter: (a IS DISTINCT FROM 1)
-   ->  Sort
-         Sort Key: a
-         ->  Seq Scan on gstest2
-(7 rows)
+   ->  Seq Scan on gstest2
+(6 rows)
 
 select v.c, (select count(*) from gstest2 group by () having v.c)
   from (values (false),(true)) v(c) order by v.c;
@@ -1018,18 +1015,17 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
 explain (costs off)
   select a, b, grouping(a,b), array_agg(v order by v)
     from gstest1 group by cube(a,b);
-                        QUERY PLAN                        
-----------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column1, "*VALUES*".column2
-   Group Key: "*VALUES*".column1
-   Group Key: ()
+   Sort Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1
+     Group Key: ()
    Sort Key: "*VALUES*".column2
      Group Key: "*VALUES*".column2
-   ->  Sort
-         Sort Key: "*VALUES*".column1, "*VALUES*".column2
-         ->  Values Scan on "*VALUES*"
-(9 rows)
+   ->  Values Scan on "*VALUES*"
+(8 rows)
 
 -- unsortable cases
 select unsortable_col, count(*)
@@ -1071,11 +1067,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: unsortable_col
-         Group Key: unhashable_col
-         ->  Sort
-               Sort Key: unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: unhashable_col
+           Group Key: unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 select unhashable_col, unsortable_col,
        grouping(unhashable_col, unsortable_col),
@@ -1114,11 +1109,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: v, unsortable_col
-         Group Key: v, unhashable_col
-         ->  Sort
-               Sort Key: v, unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: v, unhashable_col
+           Group Key: v, unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 -- empty input: first is 0 rows, second 1, third 3 etc.
 select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a);
@@ -1366,19 +1360,18 @@ explain (costs off)
 BEGIN;
 SET LOCAL enable_hashagg = false;
 EXPLAIN (COSTS OFF) SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
-              QUERY PLAN               
----------------------------------------
+           QUERY PLAN            
+---------------------------------
  Sort
    Sort Key: a, b
    ->  GroupAggregate
-         Group Key: a
-         Group Key: ()
+         Sort Key: a
+           Group Key: a
+           Group Key: ()
          Sort Key: b
            Group Key: b
-         ->  Sort
-               Sort Key: a
-               ->  Seq Scan on gstest3
-(10 rows)
+         ->  Seq Scan on gstest3
+(9 rows)
 
 SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
  a | b | count | max | max 
@@ -1549,22 +1542,21 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(13 rows)
+   ->  Seq Scan on tenk1
+(12 rows)
 
 explain (costs off)
   select unique1,
@@ -1572,18 +1564,17 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+       QUERY PLAN        
+-------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(9 rows)
+   Sort Key: unique1
+     Group Key: unique1
+   ->  Seq Scan on tenk1
+(8 rows)
 
 set work_mem = '384kB';
 explain (costs off)
@@ -1592,21 +1583,20 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
    Hash Key: thousand
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(12 rows)
+   ->  Seq Scan on tenk1
+(11 rows)
 
 -- check collation-sensitive matching between grouping expressions
 -- (similar to a check for aggregates, but there are additional code
@@ -1648,23 +1638,22 @@ select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
   (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
    from generate_series(0,199999) g) s
 group by cube (g1000,g100,g10);
-                          QUERY PLAN                           
----------------------------------------------------------------
+                      QUERY PLAN                      
+------------------------------------------------------
  GroupAggregate
-   Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
-   Group Key: ((g.g % 1000)), ((g.g % 100))
-   Group Key: ((g.g % 1000))
-   Group Key: ()
-   Sort Key: ((g.g % 100)), ((g.g % 10))
-     Group Key: ((g.g % 100)), ((g.g % 10))
-     Group Key: ((g.g % 100))
-   Sort Key: ((g.g % 10)), ((g.g % 1000))
-     Group Key: ((g.g % 10)), ((g.g % 1000))
-     Group Key: ((g.g % 10))
-   ->  Sort
-         Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
-         ->  Function Scan on generate_series g
-(14 rows)
+   Sort Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 1000), (g.g % 100)
+     Group Key: (g.g % 1000)
+     Group Key: ()
+   Sort Key: (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 100)
+   Sort Key: (g.g % 10), (g.g % 1000)
+     Group Key: (g.g % 10), (g.g % 1000)
+     Group Key: (g.g % 10)
+   ->  Function Scan on generate_series g
+(13 rows)
 
 create table gs_group_1 as
 select g1000, g100, g10, sum(g::numeric), count(*), max(g::text) from
-- 
2.14.1

0002-fixes.patchapplication/octet-stream; name=0002-fixes.patchDownload

From 4a84528d96b61b5a4828b8735f5418a79bcf526e Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 01:20:58 +0100
Subject: [PATCH 2/5] fixes

---
 src/backend/executor/nodeAgg.c          |  3 +--
 src/backend/optimizer/plan/createplan.c | 15 ++++++++---
 src/backend/optimizer/plan/planner.c    | 47 ++++++++++++++++++++++++++-------
 3 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index bf484e19ec..ebd267db68 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -516,7 +516,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
+		Sort	   *sortnode = (Sort *) aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -4672,7 +4672,6 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
-
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d5b34089aa..7c29f89cc3 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2171,10 +2171,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available; If the groupingsets need
+	 * we do need grouping columns to be available. If the groupingsets need
 	 * to sort the input, the agg will store the input rows in a tuplesort,
-	 * it therefore behooves us to request a small tlist to avoid wasting
-	 * spaces.
+	 * so we request a small tlist to avoid wasting space.
 	 */
 	if (!best_path->is_sorted)
 		flags = flags | CP_SMALL_TLIST;
@@ -2239,6 +2238,11 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+			/*
+			 * If it's the first rollup using sorted mode, add an explicit sort
+			 * node if the input is not sorted yet, for other rollups using
+			 * sorted mode, always add an explicit sort.
+			 */
 			if (!rollup->is_hashed)
 			{
 				if (!is_first_sort ||
@@ -2297,7 +2301,10 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-		/* the input is not sorted yet */
+		/*
+		 * When the rollup uses sorted mode, and the input is not already sorted,
+		 * add an explicit sort.
+		 */
 		if (!rollup->is_hashed &&
 			!best_path->is_sorted)
 		{
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b7858e8d02..6578b3fef0 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4188,13 +4188,22 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
  *
- * - strat:
- *   preferred aggregate strategy to use.
- * 
- * - is_sorted:
- *   Is the input sorted on the groupCols of the first rollup. Caller
- *   must set it correctly if strat is set to AGG_SORTED, the planner
- *   uses it to generate a sortnode.
+ * The caller specifies the preferred aggregate strategy (sorted or hashed) using
+ * the strat aprameter. When the requested strategy is AGG_SORTED, the input path
+ * needs to be sorted accordingly (is_sorted needs to be true).
+ *
+ * Pengzhou: is_sorted is acutally a hint here, the callers prefer to use
+ * AGG_SORTED are not forced to add an explicit sort path before calling
+ * this function now. please see comments in callers
+ *
+ * Ideally, consider_groupingsets_paths() should check whether the input is
+ * sorted or not, however, the callers prefer using AGG_SORTED is forced to
+ * check is_sorted already (to see whether a non-cheapest-path is worth
+ * considering), so consider_groupingsets_paths() don't need to check it again. 
+ * for callers prefer AGG_HASHED, is_sorted is never checked, they only consider
+ * the cheapest path, but the cheapest path can also be already sorted
+ * coincidentally, that's why AGG_MIZED is choosen when strat is specified
+ * to AGG_HASHED.
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4262,7 +4271,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
-			/* update is_sorted to true */
+			/* the input is coincidentally sorted usefully, update is_sorted */
 			is_sorted = true;
 		}
 
@@ -4362,7 +4371,10 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
-			/* update is_sorted to true */
+			/*
+			 * The first non-hashed rollup is PLAIN AGG, is_sorted
+			 * should be true.
+			 */
 			is_sorted = true;
 			strat = AGG_MIXED;
 		}
@@ -4397,6 +4409,9 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 *
 	 * can_hash is passed in as false if some obstacle elsewhere (such as
 	 * ordered aggs) means that we shouldn't consider hashing at all.
+	 *
+	 * XXX This comment seems to be broken by the patch, and it's not very
+	 * clear to me what it tries to say.
 	 */
 	if (can_hash && gd->any_hashable)
 	{
@@ -4448,7 +4463,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that need to be sorted.  We assign indexes "i"
+			 * one that matches the input sort order.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -6422,6 +6437,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				/* XXX Why do we do it before possibly adding an explicit sort on top? */
+				/*
+				 * Pengzhou: this patch intend to let each sorted aggregate phases
+				 * do their own sorting include the first phase, so in the final
+				 * stage of parallel grouping sets, the tuples is put into temp
+				 * storage of each sorted phase and then each sorted phase do
+				 * its own sorting one by one. 
+				 * Add a explicit sort path underneath the main Agg node will
+				 * make tuples from all groupingsets sorted using the sort key
+				 * of the first phase, it is not right.
+				 *
+				 */
 				if (parse->groupingSets)
 				{
 					/* consider AGG_SORTED strategy */
-- 
2.14.1

0003-fix-a-numtrans-bug.patchapplication/octet-stream; name=0003-fix-a-numtrans-bug.patchDownload

From fe017ede9a6e5efdb3c91e4047c3bf5bf072dee9 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Thu, 12 Mar 2020 04:38:36 -0400
Subject: [PATCH 3/5] fix a numtrans bug

aggstate->numtrans is always zero when building the hash table for
hash aggregates, this make the additional size of hash table not
correct.
---
 src/backend/executor/nodeAgg.c | 67 +++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index ebd267db68..908c2980b8 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -3574,39 +3574,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	if (use_hashing)
 	{
-		Plan   *outerplan = outerPlan(node);
-		uint64	totalGroups = 0;
-		int 	i;
-
-		aggstate->hash_metacxt = AllocSetContextCreate(
-			aggstate->ss.ps.state->es_query_cxt,
-			"HashAgg meta context",
-			ALLOCSET_DEFAULT_SIZES);
-		aggstate->hash_spill_slot = ExecInitExtraTupleSlot(
-			estate, scanDesc, &TTSOpsMinimalTuple);
-
 		/* this is an array of pointers, not structures */
 		aggstate->hash_pergroup = pergroups;
-
-		aggstate->hashentrysize = hash_agg_entry_size(
-			aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
-
-		/*
-		 * Consider all of the grouping sets together when setting the limits
-		 * and estimating the number of partitions. This can be inaccurate
-		 * when there is more than one grouping set, but should still be
-		 * reasonable.
-		 */
-		for (i = 0; i < aggstate->num_hashes; i++)
-			totalGroups = aggstate->perhash[i].aggnode->numGroups;
-
-		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
-							&aggstate->hash_mem_limit,
-							&aggstate->hash_ngroups_limit,
-							&aggstate->hash_planned_partitions);
-		find_hash_columns(aggstate);
-		build_hash_tables(aggstate);
-		aggstate->table_filled = false;
 	}
 
 	/*
@@ -3962,6 +3931,42 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	/* Initialize hash contexts and hash tables for hash aggregates */
+	if (use_hashing)
+	{
+		Plan   *outerplan = outerPlan(node);
+		uint64	totalGroups = 0;
+		int 	i;
+
+		aggstate->hash_metacxt = AllocSetContextCreate(
+			aggstate->ss.ps.state->es_query_cxt,
+			"HashAgg meta context",
+			ALLOCSET_DEFAULT_SIZES);
+		aggstate->hash_spill_slot = ExecInitExtraTupleSlot(
+			estate, scanDesc, &TTSOpsMinimalTuple);
+
+		aggstate->hashentrysize = hash_agg_entry_size(
+			aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+		/*
+		 * Consider all of the grouping sets together when setting the limits
+		 * and estimating the number of partitions. This can be inaccurate
+		 * when there is more than one grouping set, but should still be
+		 * reasonable.
+		 */
+		for (i = 0; i < aggstate->num_hashes; i++)
+			totalGroups = aggstate->perhash[i].aggnode->numGroups;
+
+		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							&aggstate->hash_mem_limit,
+							&aggstate->hash_ngroups_limit,
+							&aggstate->hash_planned_partitions);
+
+		find_hash_columns(aggstate);
+		build_hash_tables(aggstate);
+		aggstate->table_filled = false;
+	}
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
-- 
2.14.1

0004-Reorganise-the-aggregate-phases.patchapplication/octet-stream; name=0004-Reorganise-the-aggregate-phases.patchDownload

From 6cb85d8121caf8698948d4004e9639e1a1f232db Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:13:44 -0400
Subject: [PATCH 4/5] Reorganise the aggregate phases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This commit is a preparing step to support parallel grouping sets.

When planning, PG used to organize the grouping sets in [HASHED] -> [SORTED]
order which means HASHED aggregates were always located before SORTED aggregate,
when initializing AGG node, PG also organized the aggregate phases in
[HASHED]->[SORTED] order, all HASHED grouping sets were squeezed to the phase 0,
when executing AGG node, if followed AGG_SORTED or AGG_MIXED strategy, the
executor will start from phase1 -> phases2-> phases3 then phase0 if it is an
AGG_MIXED strategy. This bothers a lot when adding the support for parallel
grouping sets, firstly, we need complicated logic to locate the first sort
rollup/phase and handle the special order for a different strategy in many
places, Secondly, squeezing all hashed grouping sets to phase 0 is not working
for parallel grouping sets, we can not put all hash transition functions to one
expression state in the final stage.

This commit organizes the grouping sets in a more natural order: [SORTED]->[HASHED]
and the HASHED sets are no longer squeezed to a single phase, we use another way
to put all hash transitions to the first phase's expression state, the executor
now starts execution from phase0 for all strategies.

This commit also move 'sort_in' from AggState to AggStatePerPhase* structure,
this helps to handle more complicated cases when parallel groupingsets is
introduced, we might need to add a tuplestore 'store_in' to store partial
aggregates results for PLAIN sets then.

This commit also make the hash spill refill logic clear and avoid using
nullcheck when refilling the hashtable.
---
 contrib/postgres_fdw/expected/postgres_fdw.out    |   4 +-
 src/backend/commands/explain.c                    |   2 +-
 src/backend/executor/execExpr.c                   |  57 +-
 src/backend/executor/execExprInterp.c             |  30 +-
 src/backend/executor/nodeAgg.c                    | 946 +++++++++++-----------
 src/backend/jit/llvm/llvmjit_expr.c               |  51 +-
 src/backend/optimizer/plan/createplan.c           |  29 +-
 src/backend/optimizer/plan/planner.c              |   9 +-
 src/backend/optimizer/util/pathnode.c             |  65 +-
 src/include/executor/execExpr.h                   |   5 +-
 src/include/executor/executor.h                   |   2 +-
 src/include/executor/nodeAgg.h                    |  34 +-
 src/include/nodes/execnodes.h                     |  25 +-
 src/test/regress/expected/groupingsets.out        |  40 +-
 src/test/regress/expected/partition_aggregate.out |   2 +-
 15 files changed, 655 insertions(+), 646 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 62c2697920..fc0ed2f4d5 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3448,8 +3448,8 @@ select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls la
    Sort Key: ft1.c2
    ->  MixedAggregate
          Output: c2, sum(c1)
-         Hash Key: ft1.c2
          Group Key: ()
+         Hash Key: ft1.c2
          ->  Foreign Scan on public.ft1
                Output: c2, c1
                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
@@ -3473,8 +3473,8 @@ select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last
    Sort Key: ft1.c2
    ->  MixedAggregate
          Output: c2, sum(c1)
-         Hash Key: ft1.c2
          Group Key: ()
+         Hash Key: ft1.c2
          ->  Foreign Scan on public.ft1
                Output: c2, c1
                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8c82d6ea95..4dec889f77 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2319,7 +2319,7 @@ show_grouping_set_keys(PlanState *planstate,
 	const char *keyname;
 	const char *keysetname;
 
-	if (aggnode->aggstrategy == AGG_HASHED || aggnode->aggstrategy == AGG_MIXED)
+	if (aggnode->aggstrategy == AGG_HASHED)
 	{
 		keyname = "Hash Key";
 		keysetname = "Hash Keys";
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 1370ffec50..3533f5ccc8 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -80,7 +80,7 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-								  int transno, int setno, int setoff, bool ishash,
+								  int transno, int setno, AggStatePerPhase perphase,
 								  bool nullcheck);
 
 
@@ -2931,13 +2931,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  * the array of AggStatePerGroup, and skip evaluation if so.
  */
 ExprState *
-ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck, bool allow_concurrent_hashing)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
+	ListCell   *lc;
 	LastAttnumInfo deform = {0, 0, 0};
 
 	state->expr = (Expr *) aggstate;
@@ -2978,6 +2978,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		NullableDatum *strictargs = NULL;
 		bool	   *strictnulls = NULL;
 		int			argno;
+		int			setno;
 		ListCell   *bail;
 
 		/*
@@ -3155,37 +3156,27 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * grouping set). Do so for both sort and hash based computations, as
 		 * applicable.
 		 */
-		if (doSort)
+		for (setno = 0; setno < phase->numsets; setno++)
 		{
-			int			processGroupingSets = Max(phase->numsets, 1);
-			int			setoff = 0;
-
-			for (int setno = 0; setno < processGroupingSets; setno++)
-			{
-				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
-				setoff++;
-			}
+			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
+								  pertrans, transno, setno, phase, nullcheck);
 		}
 
-		if (doHash)
+		/*
+		 * Call transition function for HASHED aggs that can be
+		 * advanced concurrently.
+		 */
+		if (allow_concurrent_hashing &&
+			phase->concurrent_hashes)
 		{
-			int			numHashes = aggstate->num_hashes;
-			int			setoff;
-
-			/* in MIXED mode, there'll be preceding transition values */
-			if (aggstate->aggstrategy != AGG_HASHED)
-				setoff = aggstate->maxsets;
-			else
-				setoff = 0;
-
-			for (int setno = 0; setno < numHashes; setno++)
+			foreach(lc, phase->concurrent_hashes)
 			{
+				AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
+
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, true,
+									  pertrans, transno, 0,
+									  (AggStatePerPhase) perhash,
 									  nullcheck);
-				setoff++;
 			}
 		}
 
@@ -3234,14 +3225,17 @@ static void
 ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-					  int transno, int setno, int setoff, bool ishash,
+					  int transno, int setno, AggStatePerPhase perphase,
 					  bool nullcheck)
 {
 	ExprContext *aggcontext;
 	int adjust_jumpnull = -1;
 
-	if (ishash)
+	if (perphase->is_hashed)
+	{
+		Assert(setno == 0);
 		aggcontext = aggstate->hashcontext;
+	}
 	else
 		aggcontext = aggstate->aggcontexts[setno];
 
@@ -3249,9 +3243,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	if (nullcheck)
 	{
 		scratch->opcode = EEOP_AGG_PLAIN_PERGROUP_NULLCHECK;
-		scratch->d.agg_plain_pergroup_nullcheck.setoff = setoff;
+		scratch->d.agg_plain_pergroup_nullcheck.pergroups = perphase->pergroups;
 		/* adjust later */
 		scratch->d.agg_plain_pergroup_nullcheck.jumpnull = -1;
+		scratch->d.agg_plain_pergroup_nullcheck.setno = setno;
 		ExprEvalPushStep(state, scratch);
 		adjust_jumpnull = state->steps_len - 1;
 	}
@@ -3319,7 +3314,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 
 	scratch->d.agg_trans.pertrans = pertrans;
 	scratch->d.agg_trans.setno = setno;
-	scratch->d.agg_trans.setoff = setoff;
+	scratch->d.agg_trans.pergroups = perphase->pergroups;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
 	ExprEvalPushStep(state, scratch);
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 113ed1547c..b0dbba4e55 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1610,9 +1610,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 
 		EEO_CASE(EEOP_AGG_PLAIN_PERGROUP_NULLCHECK)
 		{
-			AggState   *aggstate = castNode(AggState, state->parent);
-			AggStatePerGroup pergroup_allaggs = aggstate->all_pergroups
-				[op->d.agg_plain_pergroup_nullcheck.setoff];
+			AggStatePerGroup pergroup_allaggs =
+				op->d.agg_plain_pergroup_nullcheck.pergroups
+				[op->d.agg_plain_pergroup_nullcheck.setno];
 
 			if (pergroup_allaggs == NULL)
 				EEO_JUMP(op->d.agg_plain_pergroup_nullcheck.jumpnull);
@@ -1636,8 +1636,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1665,8 +1665,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1684,8 +1684,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1702,8 +1702,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1724,8 +1724,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1742,8 +1742,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 908c2980b8..8a8b49547b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -250,6 +250,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -348,7 +349,7 @@ typedef struct HashAggSpill
  */
 typedef struct HashAggBatch
 {
-	int				 setno;			/* grouping set */
+	int				 phaseidx;		/* phase that own this batch */
 	int				 used_bits;		/* number of bits of hash already used */
 	LogicalTapeSet	*tapeset;		/* borrowed reference to tape set */
 	int				 input_tapenum;	/* input partition tape */
@@ -379,7 +380,7 @@ static void finalize_partialaggregate(AggState *aggstate,
 									  AggStatePerAgg peragg,
 									  AggStatePerGroup pergroupstate,
 									  Datum *resultVal, bool *resultIsNull);
-static void prepare_hash_slot(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash);
 static void prepare_projection_slot(AggState *aggstate,
 									TupleTableSlot *slot,
 									int currentSet);
@@ -390,9 +391,9 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
-static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
-										  bool nullcheck);
+static void build_hash_table(AggState *aggstate,
+							 AggStatePerPhaseHash perhash, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate);
 static long hash_choose_num_buckets(double hashentrysize,
 									long estimated_nbuckets,
 									Size memory);
@@ -400,12 +401,16 @@ static int hash_choose_num_partitions(uint64 input_groups,
 									  double hashentrysize,
 									  int used_bits,
 									  int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash);
-static void lookup_hash_entries(AggState *aggstate);
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate,
+										  AggStatePerPhaseHash perhash,
+										  uint32 hash);
+static void lookup_hash_entries(AggState *aggstate,
+								AggStatePerPhaseHash perhash,
+								List *perhashes);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static bool agg_refill_hash_table(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
 static void hash_agg_check_limits(AggState *aggstate);
@@ -415,7 +420,7 @@ static void hash_agg_update_metrics(AggState *aggstate, bool from_tape,
 static void hashagg_finish_initial_spills(AggState *aggstate);
 static void hashagg_reset_spill_state(AggState *aggstate);
 static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset,
-									   int input_tapenum, int setno,
+									   int input_tapenum, int phaseidx,
 									   int64 input_tuples, int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
@@ -424,7 +429,7 @@ static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
 static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
 								uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
-								 int setno);
+								 int phaseidx);
 static void hashagg_tapeinfo_init(AggState *aggstate);
 static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
 									int ndest);
@@ -458,7 +463,10 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 	 * ExecAggPlainTransByRef().
 	 */
 	if (is_hash)
+	{
+		Assert(setno == 0);
 		aggstate->curaggcontext = aggstate->hashcontext;
+	}
 	else
 		aggstate->curaggcontext = aggstate->aggcontexts[setno];
 
@@ -466,72 +474,73 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 }
 
 /*
- * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
+ * Switch to phase "newphase", which must either be 0 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
 {
-	Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);
+	AggStatePerPhase current_phase;
+	AggStatePerPhaseSort persort;
+
+	/* Don't use aggstate->phase here, it might not be initialized yet*/
+	current_phase = aggstate->phases[aggstate->current_phase];
 
 	/*
 	 * Whatever the previous state, we're now done with whatever input
-	 * tuplesort was in use.
+	 * tuplesort was in use, cleanup them.
+	 *
+	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * rescan in some cases without resorting the input again.
 	 */
-	if (aggstate->sort_in)
+	if (!current_phase->is_hashed && aggstate->current_phase > 0)
 	{
-		tuplesort_end(aggstate->sort_in);
-		aggstate->sort_in = NULL;
-	}
-
-	if (newphase <= 1)
-	{
-		/*
-		 * Discard any existing output tuplesort.
-		 */
-		if (aggstate->sort_out)
+		persort = (AggStatePerPhaseSort) current_phase;
+		if (persort->sort_in)
 		{
-			tuplesort_end(aggstate->sort_out);
-			aggstate->sort_out = NULL;
+			tuplesort_end(persort->sort_in);
+			persort->sort_in = NULL;
 		}
 	}
-	else
-	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
-		 */
-		aggstate->sort_in = aggstate->sort_out;
-		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
-	}
+
+	/* advance to next phase */
+	aggstate->current_phase = newphase;
+	aggstate->phase = aggstate->phases[newphase];
+
+	if (aggstate->phase->is_hashed)
+		return;
+
+	/* New phase is not hashed */
+	persort = (AggStatePerPhaseSort) aggstate->phase;
+
+	/* This is the right time to actually sort it. */
+	if (persort->sort_in)
+		tuplesort_performsort(persort->sort_in);
 
 	/*
-	 * If this isn't the last phase, we need to sort appropriately for the
+	 * If copy_out is set, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
+	if (persort->copy_out)
 	{
-		Sort	   *sortnode = (Sort *) aggstate->phases[newphase + 1].aggnode->sortnode;
-		PlanState  *outerNode = outerPlanState(aggstate);
-		TupleDesc	tupDesc = ExecGetResultType(outerNode);
-
-		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
-												  sortnode->numCols,
-												  sortnode->sortColIdx,
-												  sortnode->sortOperators,
-												  sortnode->collations,
-												  sortnode->nullsFirst,
-												  work_mem,
-												  NULL, false);
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[newphase + 1];
+		Sort *sortnode = (Sort *) next->phasedata.aggnode->sortnode;
+		PlanState *outerNode = outerPlanState(aggstate);
+		TupleDesc tupDesc = ExecGetResultType(outerNode);
+
+		Assert(!next->phasedata.is_hashed);
+
+		if (!next->sort_in)
+			next->sort_in = tuplesort_begin_heap(tupDesc,
+												 sortnode->numCols,
+												 sortnode->sortColIdx,
+												 sortnode->sortOperators,
+												 sortnode->collations,
+												 sortnode->nullsFirst,
+												 work_mem,
+												 NULL, false);
 	}
-
-	aggstate->current_phase = newphase;
-	aggstate->phase = &aggstate->phases[newphase];
 }
 
 /*
@@ -546,12 +555,16 @@ static TupleTableSlot *
 fetch_input_tuple(AggState *aggstate)
 {
 	TupleTableSlot *slot;
+	AggStatePerPhaseSort current_phase;
+
+	Assert(!aggstate->phase->is_hashed);
+	current_phase = (AggStatePerPhaseSort) aggstate->phase;
 
-	if (aggstate->sort_in)
+	if (current_phase->sort_in)
 	{
 		/* make sure we check for interrupts in either path through here */
 		CHECK_FOR_INTERRUPTS();
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, false,
+		if (!tuplesort_gettupleslot(current_phase->sort_in, true, false,
 									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
@@ -559,8 +572,13 @@ fetch_input_tuple(AggState *aggstate)
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
-	if (!TupIsNull(slot) && aggstate->sort_out)
-		tuplesort_puttupleslot(aggstate->sort_out, slot);
+	if (!TupIsNull(slot) && current_phase->copy_out)
+	{
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[aggstate->current_phase + 1];
+		Assert(!next->phasedata.is_hashed);
+		tuplesort_puttupleslot(next->sort_in, slot);
+	}
 
 	return slot;
 }
@@ -666,7 +684,7 @@ initialize_aggregates(AggState *aggstate,
 					  int numReset)
 {
 	int			transno;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			setno = 0;
 	int			numTrans = aggstate->numtrans;
 	AggStatePerTrans transstates = aggstate->pertrans;
@@ -1194,10 +1212,9 @@ finalize_partialaggregate(AggState *aggstate,
  * hashslot. This is necessary to compute the hash or perform a lookup.
  */
 static void
-prepare_hash_slot(AggState *aggstate)
+prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash)
 {
 	TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	int				i;
 
@@ -1431,29 +1448,33 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 static void
 build_hash_tables(AggState *aggstate)
 {
-	int				setno;
+	int	phaseidx;
 
-	for (setno = 0; setno < aggstate->num_hashes; ++setno)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[setno];
+		AggStatePerPhaseHash perhash;
+		AggStatePerPhase phase = aggstate->phases[phaseidx];
 		long			nbuckets;
 		Size			memory;
 
+		if (!phase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) phase;
+
 		if (perhash->hashtable != NULL)
 		{
 			ResetTupleHashTable(perhash->hashtable);
 			continue;
 		}
 
-		Assert(perhash->aggnode->numGroups > 0);
-
 		memory = aggstate->hash_mem_limit / aggstate->num_hashes;
 
 		/* choose reasonable number of buckets per hashtable */
 		nbuckets = hash_choose_num_buckets(
-			aggstate->hashentrysize, perhash->aggnode->numGroups, memory);
+			aggstate->hashentrysize, phase->aggnode->numGroups, memory);
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, perhash, nbuckets);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1463,9 +1484,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, AggStatePerPhaseHash perhash, long nbuckets)
 {
-	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext	metacxt = aggstate->hash_metacxt;
 	MemoryContext	hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
 	MemoryContext	tmpcxt	= aggstate->tmpcontext->ecxt_per_tuple_memory;
@@ -1489,7 +1509,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 		perhash->hashGrpColIdxHash,
 		perhash->eqfuncoids,
 		perhash->hashfunctions,
-		perhash->aggnode->grpCollations,
+		perhash->phasedata.aggnode->grpCollations,
 		nbuckets,
 		additionalsize,
 		metacxt,
@@ -1528,23 +1548,29 @@ find_hash_columns(AggState *aggstate)
 {
 	Bitmapset  *base_colnos;
 	List	   *outerTlist = outerPlanState(aggstate)->plan->targetlist;
-	int			numHashes = aggstate->num_hashes;
 	EState	   *estate = aggstate->ss.ps.state;
 	int			j;
 
 	/* Find Vars that will be needed in tlist and qual */
 	base_colnos = find_unaggregated_cols(aggstate);
 
-	for (j = 0; j < numHashes; ++j)
+	for (j = 0; j < aggstate->numphases; ++j)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[j];
+		AggStatePerPhase perphase = aggstate->phases[j];
+		AggStatePerPhaseHash perhash;
 		Bitmapset  *colnos = bms_copy(base_colnos);
-		AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
+		Bitmapset  *grouped_cols = perphase->grouped_cols[0];
+		AttrNumber *grpColIdx = perphase->aggnode->grpColIdx;
 		List	   *hashTlist = NIL;
+		ListCell   *lc;
 		TupleDesc	hashDesc;
 		int			maxCols;
 		int			i;
 
+		if (!perphase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) perphase;
 		perhash->largestGrpColIdx = 0;
 
 		/*
@@ -1554,18 +1580,12 @@ find_hash_columns(AggState *aggstate)
 		 * there'd be no point storing them.  Use prepare_projection_slot's
 		 * logic to determine which.
 		 */
-		if (aggstate->phases[0].grouped_cols)
+		foreach(lc, aggstate->all_grouped_cols)
 		{
-			Bitmapset  *grouped_cols = aggstate->phases[0].grouped_cols[j];
-			ListCell   *lc;
+			int			attnum = lfirst_int(lc);
 
-			foreach(lc, aggstate->all_grouped_cols)
-			{
-				int			attnum = lfirst_int(lc);
-
-				if (!bms_is_member(attnum, grouped_cols))
-					colnos = bms_del_member(colnos, attnum);
-			}
+			if (!bms_is_member(attnum, grouped_cols))
+				colnos = bms_del_member(colnos, attnum);
 		}
 
 		/*
@@ -1621,7 +1641,7 @@ find_hash_columns(AggState *aggstate)
 		hashDesc = ExecTypeFromTL(hashTlist);
 
 		execTuplesHashPrepare(perhash->numCols,
-							  perhash->aggnode->grpOperators,
+							  perphase->aggnode->grpOperators,
 							  &perhash->eqfuncoids,
 							  &perhash->hashfunctions);
 		perhash->hashslot =
@@ -1668,28 +1688,46 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
  * expressions in the AggStatePerPhase, and reuse when appropriate.
  */
 static void
-hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
+hashagg_recompile_expressions(AggState *aggstate)
 {
-	AggStatePerPhase		 phase;
-	int						 i = minslot ? 1 : 0;
-	int						 j = nullcheck ? 1 : 0;
+	AggStatePerPhase		 phase = aggstate->phase;
 
 	Assert(aggstate->aggstrategy == AGG_HASHED ||
 		   aggstate->aggstrategy == AGG_MIXED);
 
-	if (aggstate->aggstrategy == AGG_HASHED)
-		phase = &aggstate->phases[0];
-	else /* AGG_MIXED */
-		phase = &aggstate->phases[1];
-
-	if (phase->evaltrans_cache[i][j] == NULL)
+	if (phase->evaltrans_cache[aggstate->evaltrans_mode] == NULL)
 	{
 		const TupleTableSlotOps *outerops	= aggstate->ss.ps.outerops;
-		bool					 outerfixed = aggstate->ss.ps.outeropsfixed;
-		bool					 dohash		= true;
-		bool					 dosort;
+		bool	outerfixed = aggstate->ss.ps.outeropsfixed;
+		bool	minslot = false;
+		int		nullcheck = false;
+		int		allow_concurrent_hashing = true;
 
-		dosort = aggstate->aggstrategy == AGG_MIXED ? true : false;
+		/*
+		 * we are refilling the hash table and we disallow concurrent hashing
+		 * within transition expression because we refill the hash tables one
+		 * set by one set, this can avoid unnecessary nullcheck, meanwhile, we
+		 * get tuple from spill file, so it is a MinimalTuple.
+		 */
+		if (aggstate->evaltrans_mode == HASHREFILLMODE)
+		{
+			minslot = true;
+			nullcheck = false;
+			allow_concurrent_hashing = false;
+		}
+		/*
+		 * we entred the spill mode, the concurrent hashing still works in this
+		 * mode, but some grouping sets need to put the tuple into spill files
+		 * and their pergroup states will be NULL, so we need add nullcheck.
+		 * HASHFILLINGSPILL is only set in the first phase, so we used the
+		 * outer slot and minslot should be false.
+		 */
+		else if (aggstate->evaltrans_mode == HASHSPILLMODE)
+		{
+			minslot = false;
+			nullcheck = true;
+			allow_concurrent_hashing = true;
+		}
 
 		/* temporarily change the outerops while compiling the expression */
 		if (minslot)
@@ -1698,15 +1736,15 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
 			aggstate->ss.ps.outeropsfixed = true;
 		}
 
-		phase->evaltrans_cache[i][j] = ExecBuildAggTrans(
-			aggstate, phase, dosort, dohash, nullcheck);
+		phase->evaltrans_cache[aggstate->evaltrans_mode] =
+			ExecBuildAggTrans(aggstate, phase, nullcheck, allow_concurrent_hashing);
 
 		/* change back */
 		aggstate->ss.ps.outerops = outerops;
 		aggstate->ss.ps.outeropsfixed = outerfixed;
 	}
 
-	phase->evaltrans = phase->evaltrans_cache[i][j];
+	phase->evaltrans = phase->evaltrans_cache[aggstate->evaltrans_mode];
 }
 
 /*
@@ -1803,29 +1841,22 @@ static void
 hash_agg_enter_spill_mode(AggState *aggstate)
 {
 	aggstate->hash_spill_mode = true;
-	hashagg_recompile_expressions(aggstate, aggstate->table_filled, true);
+
+	/* if table_filled is true, we must be refilling the hash table */
+	if (aggstate->table_filled)
+		aggstate->evaltrans_mode = HASHREFILLMODE;
+	else
+		aggstate->evaltrans_mode = HASHSPILLMODE;
+
+	hashagg_recompile_expressions(aggstate);
 
 	if (!aggstate->hash_ever_spilled)
 	{
 		Assert(aggstate->hash_tapeinfo == NULL);
-		Assert(aggstate->hash_spills == NULL);
 
 		aggstate->hash_ever_spilled = true;
 
 		hashagg_tapeinfo_init(aggstate);
-
-		aggstate->hash_spills = palloc(
-			sizeof(HashAggSpill) * aggstate->num_hashes);
-
-		for (int setno = 0; setno < aggstate->num_hashes; setno++)
-		{
-			AggStatePerHash	 perhash = &aggstate->perhash[setno];
-			HashAggSpill	*spill	 = &aggstate->hash_spills[setno];
-
-			hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
-							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
-		}
 	}
 }
 
@@ -1977,9 +2008,8 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
  * the current grouping set, return NULL and the caller will spill it to disk.
  */
 static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash)
+lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash, uint32 hash)
 {
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	TupleHashEntryData *entry;
 	bool			isnew = false;
@@ -2043,33 +2073,41 @@ lookup_hash_entry(AggState *aggstate, uint32 hash)
  * efficient.
  */
 static void
-lookup_hash_entries(AggState *aggstate)
+lookup_hash_entries(AggState *aggstate, AggStatePerPhaseHash perhash,
+					List *concurrent_hashes)
 {
-	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
-	int			setno;
+	ListCell	*lc;
+	List		*all_hashes = perhash ? list_make1(perhash) : NIL;
+
+	all_hashes = list_concat(all_hashes, concurrent_hashes);
 
-	for (setno = 0; setno < aggstate->num_hashes; setno++)
+	foreach (lc, all_hashes)
 	{
-		AggStatePerHash	perhash = &aggstate->perhash[setno];
 		uint32			hash;
+		AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
 
-		select_current_set(aggstate, setno, true);
-		prepare_hash_slot(aggstate);
+		select_current_set(aggstate, 0, true);
+		prepare_hash_slot(aggstate, perhash);
 		hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
-		pergroup[setno] = lookup_hash_entry(aggstate, hash);
+		perhash->phasedata.pergroups[0] = lookup_hash_entry(aggstate, perhash, hash);
 
 		/* check to see if we need to spill the tuple for this grouping set */
-		if (pergroup[setno] == NULL)
+		if (perhash->phasedata.pergroups[0] == NULL)
 		{
-			HashAggSpill	*spill	 = &aggstate->hash_spills[setno];
 			TupleTableSlot	*slot	 = aggstate->tmpcontext->ecxt_outertuple;
 
-			if (spill->partitions == NULL)
-				hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
-								   perhash->aggnode->numGroups,
+			if (perhash->hash_spill == NULL)
+				perhash->hash_spill = palloc0(sizeof(HashAggSpill));
+
+			if (perhash->hash_spill->partitions == NULL)
+				hashagg_spill_init(perhash->hash_spill,
+								   aggstate->hash_tapeinfo, 0,
+								   perhash->phasedata.aggnode->numGroups,
 								   aggstate->hashentrysize);
 
-			hashagg_spill_tuple(spill, slot, hash);
+			hashagg_spill_tuple(perhash->hash_spill,
+								slot,
+								hash);
 		}
 	}
 }
@@ -2103,12 +2141,11 @@ ExecAgg(PlanState *pstate)
 			case AGG_HASHED:
 				if (!node->table_filled)
 					agg_fill_hash_table(node);
-				/* FALLTHROUGH */
-			case AGG_MIXED:
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+			case AGG_MIXED:
 				if (!node->input_sorted)
 					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
@@ -2136,8 +2173,8 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->numsets > 0;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
 	int			numReset;
@@ -2154,7 +2191,7 @@ agg_retrieve_direct(AggState *aggstate)
 	tmpcontext = aggstate->tmpcontext;
 
 	peragg = aggstate->peragg;
-	pergroups = aggstate->pergroups;
+	pergroups = aggstate->phase->pergroups;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
 	/*
@@ -2212,25 +2249,35 @@ agg_retrieve_direct(AggState *aggstate)
 		{
 			if (aggstate->current_phase < aggstate->numphases - 1)
 			{
+				/* Advance to the next phase */
 				initialize_phase(aggstate, aggstate->current_phase + 1);
-				aggstate->input_done = false;
-				aggstate->projected_set = -1;
-				numGroupingSets = Max(aggstate->phase->numsets, 1);
-				node = aggstate->phase->aggnode;
-				numReset = numGroupingSets;
-			}
-			else if (aggstate->aggstrategy == AGG_MIXED)
-			{
-				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
-				 */
-				initialize_phase(aggstate, 0);
-				aggstate->table_filled = true;
-				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-									   &aggstate->perhash[0].hashiter);
-				select_current_set(aggstate, 0, true);
-				return agg_retrieve_hash_table(aggstate);
+
+				/* Check whether new phase is an AGG_HASHED */
+				if (!aggstate->phase->is_hashed)
+				{
+					aggstate->input_done = false;
+					aggstate->projected_set = -1;
+					numGroupingSets = aggstate->phase->numsets;
+					node = aggstate->phase->aggnode;
+					numReset = numGroupingSets;
+					pergroups = aggstate->phase->pergroups;
+				}
+				else
+				{
+					AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) aggstate->phase;
+					/* finalize any spills */
+					hashagg_finish_initial_spills(aggstate);
+
+
+					/*
+					 * Mixed mode; we've output all the grouped stuff and have
+					 * full hashtables, so switch to outputting those.
+					 */
+					aggstate->table_filled = true;
+					ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
+					select_current_set(aggstate, 0, true);
+					return agg_retrieve_hash_table(aggstate);
+				}
 			}
 			else
 			{
@@ -2269,11 +2316,11 @@ agg_retrieve_direct(AggState *aggstate)
 		 */
 		tmpcontext->ecxt_innertuple = econtext->ecxt_outertuple;
 		if (aggstate->input_done ||
-			(node->aggstrategy != AGG_PLAIN &&
+			(aggstate->phase->aggnode->numCols > 0 &&
 			 aggstate->projected_set != -1 &&
 			 aggstate->projected_set < (numGroupingSets - 1) &&
 			 nextSetSize > 0 &&
-			 !ExecQualAndReset(aggstate->phase->eqfunctions[nextSetSize - 1],
+			 !ExecQualAndReset(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[nextSetSize - 1],
 							   tmpcontext)))
 		{
 			aggstate->projected_set += 1;
@@ -2376,13 +2423,13 @@ agg_retrieve_direct(AggState *aggstate)
 				for (;;)
 				{
 					/*
-					 * During phase 1 only of a mixed agg, we need to update
-					 * hashtables as well in advance_aggregates.
+					 * If current phase can do transition concurrently, we need
+					 * to update hashtables as well in advance_aggregates.
 					 */
-					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+					if (aggstate->phase->concurrent_hashes)
 					{
-						lookup_hash_entries(aggstate);
+						lookup_hash_entries(aggstate, NULL,
+											aggstate->phase->concurrent_hashes);
 					}
 
 					/* Advance the aggregates (or combine functions) */
@@ -2396,11 +2443,6 @@ agg_retrieve_direct(AggState *aggstate)
 					{
 						/* no more outer-plan tuples available */
 
-						/* if we built hash tables, finalize any spills */
-						if (aggstate->aggstrategy == AGG_MIXED &&
-							aggstate->current_phase == 1)
-							hashagg_finish_initial_spills(aggstate);
-
 						if (hasGroupingSets)
 						{
 							aggstate->input_done = true;
@@ -2419,10 +2461,10 @@ agg_retrieve_direct(AggState *aggstate)
 					 * If we are grouping, check whether we've crossed a group
 					 * boundary.
 					 */
-					if (node->aggstrategy != AGG_PLAIN)
+					if (aggstate->phase->aggnode->numCols > 0)
 					{
 						tmpcontext->ecxt_innertuple = firstSlot;
-						if (!ExecQual(aggstate->phase->eqfunctions[node->numCols - 1],
+						if (!ExecQual(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[node->numCols - 1],
 									  tmpcontext))
 						{
 							aggstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
@@ -2471,24 +2513,31 @@ agg_retrieve_direct(AggState *aggstate)
 static void
 agg_sort_input(AggState *aggstate)
 {
-	AggStatePerPhase phase = &aggstate->phases[1];
+	AggStatePerPhase phase = aggstate->phases[0];
+	AggStatePerPhaseSort persort = (AggStatePerPhaseSort) phase;
 	TupleDesc	tupDesc;
 	Sort		*sortnode;
+	bool		randomAccess;
 
 	Assert(!aggstate->input_sorted);
+	Assert(!phase->is_hashed);
 	Assert(phase->aggnode->sortnode);
 
 	sortnode = (Sort *) phase->aggnode->sortnode;
 	tupDesc = ExecGetResultType(outerPlanState(aggstate));
-
-	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
-											 sortnode->numCols,
-											 sortnode->sortColIdx,
-											 sortnode->sortOperators,
-											 sortnode->collations,
-											 sortnode->nullsFirst,
-											 work_mem,
-											 NULL, false);
+	randomAccess = (aggstate->eflags & (EXEC_FLAG_REWIND |
+										EXEC_FLAG_BACKWARD |
+										EXEC_FLAG_MARK)) != 0;
+
+
+	persort->sort_in = tuplesort_begin_heap(tupDesc,
+											sortnode->numCols,
+											sortnode->sortColIdx,
+											sortnode->sortOperators,
+											sortnode->collations,
+											sortnode->nullsFirst,
+											work_mem,
+											NULL, randomAccess);
 	for (;;)
 	{
 		TupleTableSlot *outerslot;
@@ -2497,11 +2546,11 @@ agg_sort_input(AggState *aggstate)
 		if (TupIsNull(outerslot))
 			break;
 
-		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+		tuplesort_puttupleslot(persort->sort_in, outerslot);
 	}
 
 	/* Sort the first phase */
-	tuplesort_performsort(aggstate->sort_in);
+	tuplesort_performsort(persort->sort_in);
 
 	/* Mark the input to be sorted */
 	aggstate->input_sorted = true;
@@ -2513,8 +2562,14 @@ agg_sort_input(AggState *aggstate)
 static void
 agg_fill_hash_table(AggState *aggstate)
 {
+	AggStatePerPhaseHash currentphase;
 	TupleTableSlot *outerslot;
 	ExprContext *tmpcontext = aggstate->tmpcontext;
+	List *concurrent_hashes = aggstate->phase->concurrent_hashes;
+
+	/* Current phase must be the first phase */
+	Assert(aggstate->current_phase == 0);
+	currentphase = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
@@ -2522,7 +2577,7 @@ agg_fill_hash_table(AggState *aggstate)
 	 */
 	for (;;)
 	{
-		outerslot = fetch_input_tuple(aggstate);
+		outerslot = ExecProcNode(outerPlanState(aggstate));
 		if (TupIsNull(outerslot))
 			break;
 
@@ -2530,7 +2585,7 @@ agg_fill_hash_table(AggState *aggstate)
 		tmpcontext->ecxt_outertuple = outerslot;
 
 		/* Find or build hashtable entries */
-		lookup_hash_entries(aggstate);
+		lookup_hash_entries(aggstate, currentphase, concurrent_hashes);
 
 		/* Advance the aggregates (or combine functions) */
 		advance_aggregates(aggstate);
@@ -2548,8 +2603,7 @@ agg_fill_hash_table(AggState *aggstate)
 	aggstate->table_filled = true;
 	/* Initialize to walk the first hash table */
 	select_current_set(aggstate, 0, true);
-	ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-						   &aggstate->perhash[0].hashiter);
+	ResetTupleHashIterator(currentphase->hashtable, &currentphase->hashiter);
 }
 
 /*
@@ -2567,6 +2621,7 @@ agg_fill_hash_table(AggState *aggstate)
 static bool
 agg_refill_hash_table(AggState *aggstate)
 {
+	AggStatePerPhaseHash perhash;
 	HashAggBatch	*batch;
 	HashAggSpill	 spill;
 	HashTapeInfo	*tapeinfo = aggstate->hash_tapeinfo;
@@ -2578,6 +2633,7 @@ agg_refill_hash_table(AggState *aggstate)
 
 	batch = linitial(aggstate->hash_batches);
 	aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+	perhash = (AggStatePerPhaseHash) aggstate->phases[batch->phaseidx];
 
 	/*
 	 * Estimate the number of groups for this batch as the total number of
@@ -2592,32 +2648,15 @@ agg_refill_hash_table(AggState *aggstate)
 						batch->used_bits, &aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
-	/* there could be residual pergroup pointers; clear them */
-	for (int setoff = 0;
-		 setoff < aggstate->maxsets + aggstate->num_hashes;
-		 setoff++)
-		aggstate->all_pergroups[setoff] = NULL;
-
 	/* free memory and reset hash tables */
 	ReScanExprContext(aggstate->hashcontext);
-	for (int setno = 0; setno < aggstate->num_hashes; setno++)
-		ResetTupleHashTable(aggstate->perhash[setno].hashtable);
+	ResetTupleHashTable(perhash->hashtable);
 
 	aggstate->hash_ngroups_current = 0;
 
-	/*
-	 * In AGG_MIXED mode, hash aggregation happens in phase 1 and the output
-	 * happens in phase 0. So, we switch to phase 1 when processing a batch,
-	 * and back to phase 0 after the batch is done.
-	 */
-	Assert(aggstate->current_phase == 0);
-	if (aggstate->phase->aggstrategy == AGG_MIXED)
-	{
-		aggstate->current_phase = 1;
-		aggstate->phase = &aggstate->phases[aggstate->current_phase];
-	}
-
-	select_current_set(aggstate, batch->setno, true);
+	/* switch to the phase of current batch */
+	initialize_phase(aggstate, batch->phaseidx);
+	select_current_set(aggstate, 0, true);
 
 	/*
 	 * Spilled tuples are always read back as MinimalTuples, which may be
@@ -2626,7 +2665,8 @@ agg_refill_hash_table(AggState *aggstate)
 	 * We still need the NULL check, because we are only processing one
 	 * grouping set at a time and the rest will be NULL.
 	 */
-	hashagg_recompile_expressions(aggstate, true, true);
+	aggstate->evaltrans_mode = HASHREFILLMODE;
+	hashagg_recompile_expressions(aggstate);
 
 	LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
 							 HASHAGG_READ_BUFFER_SIZE);
@@ -2644,10 +2684,11 @@ agg_refill_hash_table(AggState *aggstate)
 		ExecStoreMinimalTuple(tuple, slot, true);
 		aggstate->tmpcontext->ecxt_outertuple = slot;
 
-		prepare_hash_slot(aggstate);
-		aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(aggstate, hash);
+		prepare_hash_slot(aggstate, perhash);
+		perhash->phasedata.pergroups[0] =
+			lookup_hash_entry(aggstate, perhash, hash);
 
-		if (aggstate->hash_pergroup[batch->setno] != NULL)
+		if (perhash->phasedata.pergroups[0] != NULL)
 		{
 			/* Advance the aggregates (or combine functions) */
 			advance_aggregates(aggstate);
@@ -2677,14 +2718,10 @@ agg_refill_hash_table(AggState *aggstate)
 
 	hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
 
-	/* change back to phase 0 */
-	aggstate->current_phase = 0;
-	aggstate->phase = &aggstate->phases[aggstate->current_phase];
-
 	if (spill_initialized)
 	{
 		hash_agg_update_metrics(aggstate, true, spill.npartitions);
-		hashagg_spill_finish(aggstate, &spill, batch->setno);
+		hashagg_spill_finish(aggstate, &spill, batch->phaseidx);
 	}
 	else
 		hash_agg_update_metrics(aggstate, true, 0);
@@ -2692,9 +2729,7 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_spill_mode = false;
 
 	/* prepare to walk the first hash table */
-	select_current_set(aggstate, batch->setno, true);
-	ResetTupleHashIterator(aggstate->perhash[batch->setno].hashtable,
-						   &aggstate->perhash[batch->setno].hashiter);
+	ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 	pfree(batch);
 
@@ -2742,7 +2777,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	TupleHashEntryData *entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	AggStatePerHash perhash;
+	AggStatePerPhaseHash perhash;
 
 	/*
 	 * get state info from node.
@@ -2753,11 +2788,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	peragg = aggstate->peragg;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
-	/*
-	 * Note that perhash (and therefore anything accessed through it) can
-	 * change inside the loop, as we change between grouping sets.
-	 */
-	perhash = &aggstate->perhash[aggstate->current_set];
+	perhash = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * We loop retrieving groups until we find one satisfying
@@ -2776,18 +2807,16 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		entry = ScanTupleHashTable(perhash->hashtable, &perhash->hashiter);
 		if (entry == NULL)
 		{
-			int			nextset = aggstate->current_set + 1;
-
-			if (nextset < aggstate->num_hashes)
+			if (aggstate->current_phase + 1 < aggstate->numphases &&
+				aggstate->evaltrans_mode != HASHREFILLMODE)
 			{
 				/*
 				 * Switch to next grouping set, reinitialize, and restart the
 				 * loop.
 				 */
-				select_current_set(aggstate, nextset, true);
-
-				perhash = &aggstate->perhash[aggstate->current_set];
-
+				select_current_set(aggstate, 0, true);
+				initialize_phase(aggstate, aggstate->current_phase + 1);
+				perhash = (AggStatePerPhaseHash) aggstate->phase;
 				ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 				continue;
@@ -2982,12 +3011,12 @@ hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
  * be done.
  */
 static HashAggBatch *
-hashagg_batch_new(LogicalTapeSet *tapeset, int tapenum, int setno,
+hashagg_batch_new(LogicalTapeSet *tapeset, int tapenum, int phaseidx,
 				  int64 input_tuples, int used_bits)
 {
 	HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
 
-	batch->setno = setno;
+	batch->phaseidx = phaseidx;
 	batch->used_bits = used_bits;
 	batch->tapeset = tapeset;
 	batch->input_tapenum = tapenum;
@@ -3053,25 +3082,31 @@ hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
 static void
 hashagg_finish_initial_spills(AggState *aggstate)
 {
-	int setno;
+	int phaseidx;
 	int total_npartitions = 0;
 
-	if (aggstate->hash_spills != NULL)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		for (setno = 0; setno < aggstate->num_hashes; setno++)
+		AggStatePerPhaseHash	perhash;
+		AggStatePerPhase		phase = aggstate->phases[phaseidx];
+
+		if (!phase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) phase;
+		if (perhash->hash_spill)
 		{
-			HashAggSpill *spill = &aggstate->hash_spills[setno];
-			total_npartitions += spill->npartitions;
-			hashagg_spill_finish(aggstate, spill, setno);
-		}
+			total_npartitions += perhash->hash_spill->npartitions;
+			hashagg_spill_finish(aggstate, perhash->hash_spill, phase->phaseidx);
 
-		/*
-		 * We're not processing tuples from outer plan any more; only
-		 * processing batches of spilled tuples. The initial spill structures
-		 * are no longer needed.
-		 */
-		pfree(aggstate->hash_spills);
-		aggstate->hash_spills = NULL;
+			/*
+			 * We're not processing tuples from outer plan any more; only
+			 * processing batches of spilled tuples. The initial spill structures
+			 * are no longer needed.
+			 */
+			pfree(perhash->hash_spill);
+			perhash->hash_spill = NULL;
+		}
 	}
 
 	hash_agg_update_metrics(aggstate, false, total_npartitions);
@@ -3084,7 +3119,7 @@ hashagg_finish_initial_spills(AggState *aggstate)
  * Transform spill partitions into new batches.
  */
 static void
-hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int phaseidx)
 {
 	int i;
 	int used_bits = 32 - spill->shift;
@@ -3102,7 +3137,7 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 			continue;
 
 		new_batch = hashagg_batch_new(aggstate->hash_tapeinfo->tapeset,
-									  tapenum, setno, spill->ntuples[i],
+									  tapenum, phaseidx, spill->ntuples[i],
 									  used_bits);
 		aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
 		aggstate->hash_batches_used++;
@@ -3118,21 +3153,25 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 static void
 hashagg_reset_spill_state(AggState *aggstate)
 {
-	ListCell *lc;
+	ListCell	*lc;
+	int			phaseidx;
 
 	/* free spills from initial pass */
-	if (aggstate->hash_spills != NULL)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		int setno;
+		AggStatePerPhaseHash	perhash;
+		AggStatePerPhase		phase = aggstate->phases[phaseidx];
 
-		for (setno = 0; setno < aggstate->num_hashes; setno++)
+		if (!phase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) phase;
+
+		if (perhash->hash_spill)
 		{
-			HashAggSpill *spill = &aggstate->hash_spills[setno];
-			pfree(spill->ntuples);
-			pfree(spill->partitions);
+			pfree(perhash->hash_spill);
+			perhash->hash_spill = NULL;
 		}
-		pfree(aggstate->hash_spills);
-		aggstate->hash_spills = NULL;
 	}
 
 	/* free batches */
@@ -3171,25 +3210,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	AggState   *aggstate;
 	AggStatePerAgg peraggs;
 	AggStatePerTrans pertransstates;
-	AggStatePerGroup *pergroups;
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
-	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
-	int			phase;
 	int			phaseidx;
 	ListCell   *l;
 	Bitmapset  *all_grouped_cols = NULL;
 	int			numGroupingSets = 1;
-	int			numPhases;
-	int			numHashes;
 	int			i = 0;
 	int			j = 0;
+	bool		need_extra_slot = false;
 	bool		use_hashing = (node->aggstrategy == AGG_HASHED ||
 							   node->aggstrategy == AGG_MIXED);
+	uint64		totalHashGroups = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -3216,24 +3252,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->curpertrans = NULL;
 	aggstate->input_done = false;
 	aggstate->agg_done = false;
-	aggstate->pergroups = NULL;
 	aggstate->grp_firstTuple = NULL;
-	aggstate->sort_in = NULL;
-	aggstate->sort_out = NULL;
 	aggstate->input_sorted = true;
-
-	/*
-	 * phases[0] always exists, but is dummy in sorted/plain mode
-	 */
-	numPhases = (use_hashing ? 1 : 2);
-	numHashes = (use_hashing ? 1 : 0);
-
-	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+	aggstate->eflags = eflags;
+	aggstate->num_hashes = 0;
+	aggstate->hash_spill_mode = HASHNORMALMODE;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
-	 * determines the size of some allocations.  Also calculate the number of
-	 * phases, since all hashed/mixed nodes contribute to only a single phase.
+	 * determines the size of some allocations.
 	 */
 	if (node->groupingSets)
 	{
@@ -3246,31 +3273,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			numGroupingSets = Max(numGroupingSets,
 								  list_length(agg->groupingSets));
 
-			/*
-			 * additional AGG_HASHED aggs become part of phase 0, but all
-			 * others add an extra phase.
-			 */
 			if (agg->aggstrategy != AGG_HASHED)
-			{
-				++numPhases;
-
-				if (!firstSortAgg)
-					firstSortAgg = agg;
-
-			}
-			else
-				++numHashes;
+				need_extra_slot = true;
 		}
 	}
 
 	aggstate->maxsets = numGroupingSets;
-	aggstate->numphases = numPhases;
+	aggstate->numphases = 1 + list_length(node->chain);
 
 	/*
-	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
-	if (firstSortAgg && firstSortAgg->sortnode)
+	if (node->sortnode)
 		aggstate->input_sorted = false;	
 
 	aggstate->aggcontexts = (ExprContext **)
@@ -3331,11 +3346,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	scanDesc = aggstate->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
 
 	/*
-	 * If there are more than two phases (including a potential dummy phase
-	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 * An extra slot is needed if 1) agg need to do its own sort 2) agg
+	 * has more than one non-hashed phases
 	 */
-	if (numPhases > 2 ||
-		!aggstate->input_sorted)
+	if (node->sortnode || need_extra_slot)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -3391,72 +3405,92 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * For each phase, prepare grouping set data and fmgr lookup data for
 	 * compare functions.  Accumulate all_grouped_cols in passing.
 	 */
-	aggstate->phases = palloc0(numPhases * sizeof(AggStatePerPhaseData));
+	aggstate->phases = palloc0(aggstate->numphases * sizeof(AggStatePerPhase));
 
-	aggstate->num_hashes = numHashes;
-	if (numHashes)
-	{
-		aggstate->perhash = palloc0(sizeof(AggStatePerHashData) * numHashes);
-		aggstate->phases[0].numsets = 0;
-		aggstate->phases[0].gset_lengths = palloc(numHashes * sizeof(int));
-		aggstate->phases[0].grouped_cols = palloc(numHashes * sizeof(Bitmapset *));
-	}
-
-	phase = 0;
-	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
 		Agg		   *aggnode;
+		AggStatePerPhase phasedata = NULL;
 
 		if (phaseidx > 0)
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
 		else
 			aggnode = node;
 
-		if (aggnode->aggstrategy == AGG_HASHED
-			|| aggnode->aggstrategy == AGG_MIXED)
+		if (aggnode->aggstrategy == AGG_HASHED)
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[0];
-			AggStatePerHash perhash;
-			Bitmapset  *cols = NULL;
+			AggStatePerPhaseHash perhash;
+			Bitmapset *cols = NULL;
 
-			Assert(phase == 0);
-			i = phasedata->numsets++;
-			perhash = &aggstate->perhash[i];
+			aggstate->num_hashes++;
+			totalHashGroups += aggnode->numGroups;
 
-			/* phase 0 always points to the "real" Agg in the hash case */
-			phasedata->aggnode = node;
-			phasedata->aggstrategy = node->aggstrategy;
-
-			/* but the actual Agg node representing this hash is saved here */
-			perhash->aggnode = aggnode;
+			perhash = (AggStatePerPhaseHash) palloc0(sizeof(AggStatePerPhaseHashData));
+			phasedata = (AggStatePerPhase) perhash;
+			phasedata->is_hashed = true;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
+			/* AGG_HASHED always has only one set */
+			phasedata->numsets = 1;
+			phasedata->gset_lengths = palloc(sizeof(int));
+			phasedata->gset_lengths[0] = perhash->numCols = aggnode->numCols;
 
+			phasedata->grouped_cols = palloc(sizeof(Bitmapset *));
 			for (j = 0; j < aggnode->numCols; ++j)
 				cols = bms_add_member(cols, aggnode->grpColIdx[j]);
-
-			phasedata->grouped_cols[i] = cols;
+			phasedata->grouped_cols[0] = cols;
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
-			continue;
+
+			/*
+			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
+			 * on the fly, all pergroup states are kept in hashtable, everytime
+			 * a tuple is processed, lookup_hash_entry() choose one group and
+			 * set phasedata->pergroups[0], then advance_aggregates can use it
+			 * to do transition in this group.
+			 * We do not need to allocate a real pergroup and set the pointer
+			 * here, there are too many pergroup states, lookup_hash_entry() will
+			 * allocate it.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup));
+
+			/*
+			 * Hash aggregate does not require the order of input tuples, so
+			 * we can do the transition immediately when a tuple is fetched,
+			 * which means we can do the transition concurrently with the
+			 * first phase.
+			 */
+			if (phaseidx > 0)
+			{
+				aggstate->phases[0]->concurrent_hashes =
+					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
+				/* skip evaltrans for this phase */
+				phasedata->skip_evaltrans = true;
+			}
 		}
 		else
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[++phase];
-			int			num_sets;
+			AggStatePerPhaseSort persort;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			persort = (AggStatePerPhaseSort) palloc0(sizeof(AggStatePerPhaseSortData));
+			phasedata = (AggStatePerPhase) persort;
+			phasedata->is_hashed = false;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (num_sets)
+			if (aggnode->groupingSets)
 			{
-				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
-				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
+				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
 
 				i = 0;
 				foreach(l, aggnode->groupingSets)
 				{
-					int			current_length = list_length(lfirst(l));
-					Bitmapset  *cols = NULL;
+					int		current_length = list_length(lfirst(l));
+					Bitmapset	*cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -3473,37 +3507,49 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 			else
 			{
-				Assert(phaseidx == 0);
-
+				phasedata->numsets = 1;
 				phasedata->gset_lengths = NULL;
 				phasedata->grouped_cols = NULL;
 			}
 
+			/*
+			 * Initialize pergroup states for AGG_SORTED/AGG_PLAIN/AGG_MIXED
+			 * phases, each set only have one group on the fly, all groups in
+			 * a set can reuse a pergroup state. Unlike AGG_HASHED, we
+			 * pre-allocate the pergroup states here.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup) * phasedata->numsets);
+
+			for (i = 0; i < phasedata->numsets; i++)
+			{
+				phasedata->pergroups[i] =
+					(AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData) * numaggs);
+			}
+
 			/*
 			 * If we are grouping, precompute fmgr lookup data for inner loop.
 			 */
-			if (aggnode->aggstrategy == AGG_SORTED)
+			if (aggnode->numCols > 0)
 			{
 				int			i = 0;
 
-				Assert(aggnode->numCols > 0);
-
 				/*
 				 * Build a separate function for each subset of columns that
 				 * need to be compared.
 				 */
-				phasedata->eqfunctions =
+				persort->eqfunctions =
 					(ExprState **) palloc0(aggnode->numCols * sizeof(ExprState *));
 
 				/* for each grouping set */
-				for (i = 0; i < phasedata->numsets; i++)
+				for (i = 0; i < phasedata->numsets && phasedata->gset_lengths; i++)
 				{
 					int			length = phasedata->gset_lengths[i];
 
-					if (phasedata->eqfunctions[length - 1] != NULL)
+					if (persort->eqfunctions[length - 1] != NULL)
 						continue;
 
-					phasedata->eqfunctions[length - 1] =
+					persort->eqfunctions[length - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   length,
 											   aggnode->grpColIdx,
@@ -3513,9 +3559,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 
 				/* and for all grouped columns, unless already computed */
-				if (phasedata->eqfunctions[aggnode->numCols - 1] == NULL)
+				if (persort->eqfunctions[aggnode->numCols - 1] == NULL)
 				{
-					phasedata->eqfunctions[aggnode->numCols - 1] =
+					persort->eqfunctions[aggnode->numCols - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   aggnode->numCols,
 											   aggnode->grpColIdx,
@@ -3525,9 +3571,24 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 			}
 
-			phasedata->aggnode = aggnode;
-			phasedata->aggstrategy = aggnode->aggstrategy;
+			/*
+			 * For non-first AGG_SORTED phase, it processes the same input
+			 * tuples with previous phase except that it need to resort the
+			 * input tuples. Tell the previous phase to copy out the tuples.
+			 */
+			if (phaseidx > 0)
+			{
+				AggStatePerPhaseSort prev =
+					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
+
+				Assert(!prev->phasedata.is_hashed);
+				/* Tell the previous phase to copy the tuple to the sort_in */
+				prev->copy_out = true;
+			}
 		}
+
+		phasedata->phaseidx = phaseidx;
+		aggstate->phases[phaseidx] = phasedata;
 	}
 
 	/*
@@ -3551,51 +3612,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->peragg = peraggs;
 	aggstate->pertrans = pertransstates;
 
-
-	aggstate->all_pergroups =
-		(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup)
-									 * (numGroupingSets + numHashes));
-	pergroups = aggstate->all_pergroups;
-
-	if (node->aggstrategy != AGG_HASHED)
-	{
-		for (i = 0; i < numGroupingSets; i++)
-		{
-			pergroups[i] = (AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData)
-													  * numaggs);
-		}
-
-		aggstate->pergroups = pergroups;
-		pergroups += numGroupingSets;
-	}
-
-	/*
-	 * Hashing can only appear in the initial phase.
-	 */
-	if (use_hashing)
-	{
-		/* this is an array of pointers, not structures */
-		aggstate->hash_pergroup = pergroups;
-	}
-
-	/*
-	 * Initialize current phase-dependent values to initial phase. The initial
-	 * phase is 1 (first sort pass) for all strategies that use sorting (if
-	 * hashing is being done too, then phase 0 is processed last); but if only
-	 * hashing is being done, then phase 0 is all there is.
-	 */
-	if (node->aggstrategy == AGG_HASHED)
-	{
-		aggstate->current_phase = 0;
-		initialize_phase(aggstate, 0);
-		select_current_set(aggstate, 0, true);
-	}
-	else
-	{
-		aggstate->current_phase = 1;
-		initialize_phase(aggstate, 1);
-		select_current_set(aggstate, 0, false);
-	}
+	aggstate->current_phase = 0;
+	initialize_phase(aggstate, 0);
+	select_current_set(aggstate, 0, aggstate->aggstrategy == AGG_HASHED);
 
 	/* -----------------
 	 * Perform lookups of aggregate function info, and initialize the
@@ -3931,12 +3950,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
-	/* Initialize hash contexts and hash tables for hash aggregates */
+	/*
+	 * Initialize current phase-dependent values to initial phase.
+	 */
 	if (use_hashing)
 	{
 		Plan   *outerplan = outerPlan(node);
-		uint64	totalGroups = 0;
-		int 	i;
 
 		aggstate->hash_metacxt = AllocSetContextCreate(
 			aggstate->ss.ps.state->es_query_cxt,
@@ -3954,10 +3973,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		 * when there is more than one grouping set, but should still be
 		 * reasonable.
 		 */
-		for (i = 0; i < aggstate->num_hashes; i++)
-			totalGroups = aggstate->perhash[i].aggnode->numGroups;
-
-		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+		hash_agg_set_limits(aggstate->hashentrysize, totalHashGroups, 0,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
@@ -3976,51 +3992,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerPhase phase = &aggstate->phases[phaseidx];
-		bool		dohash = false;
-		bool		dosort = false;
+		AggStatePerPhase phase = aggstate->phases[phaseidx];
 
-		/* phase 0 doesn't necessarily exist */
-		if (!phase->aggnode)
+		if (phase->skip_evaltrans)
 			continue;
 
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
-		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
-		}
-		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
-		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
-			continue;
-		}
-		else if (phase->aggstrategy == AGG_PLAIN ||
-				 phase->aggstrategy == AGG_SORTED)
-		{
-			dohash = false;
-			dosort = true;
-		}
-		else if (phase->aggstrategy == AGG_HASHED)
-		{
-			dohash = true;
-			dosort = false;
-		}
-		else
-			Assert(false);
-
-		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
+		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, false, true);
 
 		/* cache compiled expression for outer slot without NULL check */
-		phase->evaltrans_cache[0][0] = phase->evaltrans;
+		phase->evaltrans_cache[HASHNORMALMODE] = phase->evaltrans;
 	}
 
 	return aggstate;
@@ -4506,13 +4486,21 @@ ExecEndAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	/* Make sure we have closed any open tuplesorts */
+	for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
+	{
+		AggStatePerPhase		phase = node->phases[phaseidx];
+		AggStatePerPhaseSort	persort;
 
-	if (node->sort_in)
-		tuplesort_end(node->sort_in);
-	if (node->sort_out)
-		tuplesort_end(node->sort_out);
+		if (phase->is_hashed)
+			continue;
+
+		persort = (AggStatePerPhaseSort) phase;
+		if (persort->sort_in)
+			tuplesort_end(persort->sort_in);
+	}
 
 	hashagg_reset_spill_state(node);
 
@@ -4562,6 +4550,7 @@ ExecReScanAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	node->agg_done = false;
 
@@ -4586,8 +4575,12 @@ ExecReScanAgg(AggState *node)
 		if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
 			!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
 		{
-			ResetTupleHashIterator(node->perhash[0].hashtable,
-								   &node->perhash[0].hashiter);
+			AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) node->phases[0];
+			ResetTupleHashIterator(perhash->hashtable,
+								   &perhash->hashiter);
+
+			/* reset to phase 0 */
+			initialize_phase(node, 0);
 			select_current_set(node, 0, true);
 			return;
 		}
@@ -4652,7 +4645,8 @@ ExecReScanAgg(AggState *node)
 		node->table_filled = false;
 		/* iterator will be reset when the table is filled */
 
-		hashagg_recompile_expressions(node, false, false);
+		node->hash_spill_mode = HASHNORMALMODE;
+		hashagg_recompile_expressions(node);
 	}
 
 	if (node->aggstrategy != AGG_HASHED)
@@ -4660,18 +4654,54 @@ ExecReScanAgg(AggState *node)
 		/*
 		 * Reset the per-group state (in particular, mark transvalues null)
 		 */
-		for (setno = 0; setno < numGroupingSets; setno++)
+		for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
 		{
-			MemSet(node->pergroups[setno], 0,
-				   sizeof(AggStatePerGroupData) * node->numaggs);
+			AggStatePerPhase phase = node->phases[phaseidx];
+
+			/* hash pergroups is reset by build_hash_tables */
+			if (phase->is_hashed)
+				continue;
+
+			for (setno = 0; setno < phase->numsets; setno++)
+				MemSet(phase->pergroups[setno], 0,
+					   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
-		/* Reset input_sorted */
+		/*
+		 * The agg did its own first sort using tuplesort and the first
+		 * tuplesort is kept (see initialize_phase), if the subplan does
+		 * not have any parameter changes, and none of our own parameter
+		 * changes affect input expressions of the aggregated functions,
+		 * then we can just rescan the first tuplesort, no need to build
+		 * it again.
+		 *
+		 * Note: agg only do its own sort for groupingsets now.
+		 */
 		if (aggnode->sortnode)
-			node->input_sorted = false;
+		{
+			AggStatePerPhaseSort firstphase = (AggStatePerPhaseSort) node->phases[0];
+			bool randomAccess = (node->eflags & (EXEC_FLAG_REWIND |
+												 EXEC_FLAG_BACKWARD |
+												 EXEC_FLAG_MARK)) != 0;
+			if (firstphase->sort_in &&
+				randomAccess &&
+				outerPlan->chgParam == NULL &&
+				!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
+			{
+				tuplesort_rescan(firstphase->sort_in);
+				node->input_sorted = true;
+			}
+			else
+			{
+				if (firstphase->sort_in)
+					tuplesort_end(firstphase->sort_in);
+				firstphase->sort_in = NULL;
+				node->input_sorted = false;
+			}
+		}
 
-		/* reset to phase 1 */
-		initialize_phase(node, 1);
+		/* reset to phase 0 */
+		initialize_phase(node, 0);
 
 		node->input_done = false;
 		node->projected_set = -1;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index b855e73957..066cd59554 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2049,30 +2049,26 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_AGG_PLAIN_PERGROUP_NULLCHECK:
 				{
 					int				 jumpnull;
-					LLVMValueRef	 v_aggstatep;
-					LLVMValueRef	 v_allpergroupsp;
+					LLVMValueRef	 v_pergroupsp;
 					LLVMValueRef	 v_pergroup_allaggs;
-					LLVMValueRef	 v_setoff;
+					LLVMValueRef	 v_setno;
 
 					jumpnull = op->d.agg_plain_pergroup_nullcheck.jumpnull;
 
 					/*
-					 * pergroup_allaggs = aggstate->all_pergroups
-					 * [op->d.agg_plain_pergroup_nullcheck.setoff];
+					 * pergroup =
+					 * &op->d.agg_plain_pergroup_nullcheck.pergroups
+					 * [op->d.agg_plain_pergroup_nullcheck.setno];
 					 */
-					v_aggstatep = LLVMBuildBitCast(
-						b, v_parent, l_ptr(StructAggState), "");
+					v_pergroupsp =
+						l_ptr_const(op->d.agg_plain_pergroup_nullcheck.pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
 
-					v_allpergroupsp = l_load_struct_gep(
-						b, v_aggstatep,
-						FIELDNO_AGGSTATE_ALL_PERGROUPS,
-						"aggstate.all_pergroups");
+					v_setno =
+						l_int32_const(op->d.agg_plain_pergroup_nullcheck.setno);
 
-					v_setoff = l_int32_const(
-						op->d.agg_plain_pergroup_nullcheck.setoff);
-
-					v_pergroup_allaggs = l_load_gep1(
-						b, v_allpergroupsp, v_setoff, "");
+					v_pergroup_allaggs =
+						l_load_gep1(b, v_pergroupsp, v_setno, "");
 
 					LLVMBuildCondBr(
 						b,
@@ -2094,6 +2090,7 @@ llvm_compile_expr(ExprState *state)
 				{
 					AggState   *aggstate;
 					AggStatePerTrans pertrans;
+					AggStatePerGroup *pergroups;
 					FunctionCallInfo fcinfo;
 
 					LLVMValueRef v_aggstatep;
@@ -2103,12 +2100,12 @@ llvm_compile_expr(ExprState *state)
 					LLVMValueRef v_transvaluep;
 					LLVMValueRef v_transnullp;
 
-					LLVMValueRef v_setoff;
+					LLVMValueRef v_setno;
 					LLVMValueRef v_transno;
 
 					LLVMValueRef v_aggcontext;
 
-					LLVMValueRef v_allpergroupsp;
+					LLVMValueRef v_pergroupsp;
 					LLVMValueRef v_current_setp;
 					LLVMValueRef v_current_pertransp;
 					LLVMValueRef v_curaggcontext;
@@ -2124,6 +2121,7 @@ llvm_compile_expr(ExprState *state)
 
 					aggstate = castNode(AggState, state->parent);
 					pertrans = op->d.agg_trans.pertrans;
+					pergroups = op->d.agg_trans.pergroups;
 
 					fcinfo = pertrans->transfn_fcinfo;
 
@@ -2133,19 +2131,18 @@ llvm_compile_expr(ExprState *state)
 											  l_ptr(StructAggStatePerTransData));
 
 					/*
-					 * pergroup = &aggstate->all_pergroups
-					 * [op->d.agg_strict_trans_check.setoff]
-					 * [op->d.agg_init_trans_check.transno];
+					 * pergroup = &op->d.agg_trans.pergroups
+					 * [op->d.agg_trans.setno]
+					 * [op->d.agg_trans.transno];
 					 */
-					v_allpergroupsp =
-						l_load_struct_gep(b, v_aggstatep,
-										  FIELDNO_AGGSTATE_ALL_PERGROUPS,
-										  "aggstate.all_pergroups");
-					v_setoff = l_int32_const(op->d.agg_trans.setoff);
+					v_pergroupsp =
+						l_ptr_const(pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
+					v_setno = l_int32_const(op->d.agg_trans.setno);
 					v_transno = l_int32_const(op->d.agg_trans.transno);
 					v_pergroupp =
 						LLVMBuildGEP(b,
-									 l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
+									 l_load_gep1(b, v_pergroupsp, v_setno, ""),
 									 &v_transno, 1, "");
 
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7c29f89cc3..e9ad5a98cb 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2226,8 +2226,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	chain = NIL;
 	if (list_length(rollups) > 1)
 	{
-		bool		is_first_sort = ((RollupData *) linitial(rollups))->is_hashed;
-
 		for_each_cell(lc, rollups, list_second_cell(rollups))
 		{
 			RollupData *rollup = lfirst(lc);
@@ -2245,24 +2243,17 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			 */
 			if (!rollup->is_hashed)
 			{
-				if (!is_first_sort ||
-					(is_first_sort && !best_path->is_sorted))
-				{
-					sort_plan = (Plan *)
-						make_sort_from_groupcols(rollup->groupClause,
-												 new_grpColIdx,
-												 subplan);
-
-					/*
-					 * Remove stuff we don't need to avoid bloating debug output.
-					 */
-					sort_plan->targetlist = NIL;
-					sort_plan->lefttree = NULL;
-				}
-			}
+				sort_plan = (Plan *)
+					make_sort_from_groupcols(rollup->groupClause,
+											 new_grpColIdx,
+											 subplan);
 
-			if (!rollup->is_hashed)
-				is_first_sort = false;
+				/*
+				 * Remove stuff we don't need to avoid bloating debug output.
+				 */
+				sort_plan->targetlist = NIL;
+				sort_plan->lefttree = NULL;
+			}
 
 			if (rollup->is_hashed)
 				strat = AGG_HASHED;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6578b3fef0..f26e962ac9 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4357,7 +4357,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (unhashed_rollup)
 		{
-			new_rollups = lappend(new_rollups, unhashed_rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(unhashed_rollup, new_rollups);
 			strat = AGG_MIXED;
 		}
 		else if (empty_sets)
@@ -4370,7 +4371,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = list_length(empty_sets);
 			rollup->hashable = false;
 			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(rollup, new_rollups);
 			/*
 			 * The first non-hashed rollup is PLAIN AGG, is_sorted
 			 * should be true.
@@ -4539,7 +4541,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			/* non-hashed rollup always sit before hashed rollup */
+			rollups = lappend(rollups, rollup);
 		}
 
 		if (rollups)
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 6e8899227f..4578c3184b 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3001,7 +3001,6 @@ create_groupingsets_path(PlannerInfo *root,
 	PathTarget *target = rel->reltarget;
 	ListCell   *lc;
 	bool		is_first = true;
-	bool		is_first_sort = true;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3054,14 +3053,13 @@ create_groupingsets_path(PlannerInfo *root,
 		int			numGroupCols = list_length(linitial(gsets));
 
 		/*
-		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup takes the
-		 * (already-sorted) input, and following ones do their own sort.
+		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
+		 * sort if is_sorted is false, the following ones do their own sort.
 		 *
 		 * In AGG_HASHED mode, there is one rollup for each grouping set.
 		 *
-		 * In AGG_MIXED mode, the first rollups are hashed, the first
-		 * non-hashed one takes the (already-sorted) input, and following ones
-		 * do their own sort.
+		 * In AGG_MIXED mode, the first rollup do its own sort if is_sorted
+		 * is false, the following non-hashed ones do their own sort.
 		 */
 		if (is_first)
 		{
@@ -3095,33 +3093,21 @@ create_groupingsets_path(PlannerInfo *root,
 					 subpath->rows,
 					 subpath->pathtarget->width);
 			is_first = false;
-			if (!rollup->is_hashed)
-				is_first_sort = false;
 		}
 		else
 		{
+			AggStrategy	rollup_strategy;
 			Path		sort_path;	/* dummy for result of cost_sort */
 			Path		agg_path;	/* dummy for result of cost_agg */
 
-			if (rollup->is_hashed || (is_first_sort && is_sorted))
-			{
-				/*
-				 * Account for cost of aggregation, but don't charge input
-				 * cost again
-				 */
-				cost_agg(&agg_path, root,
-						 rollup->is_hashed ? AGG_HASHED : AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 0.0, 0.0,
-						 subpath->rows,
-						 subpath->pathtarget->width);
-				if (!rollup->is_hashed)
-					is_first_sort = false;
-			}
-			else
+			sort_path.startup_cost = 0;
+			sort_path.total_cost = 0;
+			sort_path.rows = subpath->rows;
+
+			rollup_strategy = rollup->is_hashed ?
+				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
+
+			if (!rollup->is_hashed && numGroupCols)
 			{
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
@@ -3131,21 +3117,20 @@ create_groupingsets_path(PlannerInfo *root,
 						  0.0,
 						  work_mem,
 						  -1.0);
-
-				/* Account for cost of aggregation */
-
-				cost_agg(&agg_path, root,
-						 AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 sort_path.startup_cost,
-						 sort_path.total_cost,
-						 sort_path.rows,
-						 subpath->pathtarget->width);
 			}
 
+			/* Account for cost of aggregation */
+			cost_agg(&agg_path, root,
+					 rollup_strategy,
+					 agg_costs,
+					 numGroupCols,
+					 rollup->numGroups,
+					 having_qual,
+					 sort_path.startup_cost,
+					 sort_path.total_cost,
+					 sort_path.rows,
+					 subpath->pathtarget->width);
+
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index dbe8649a57..4ed5d0a7de 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -626,7 +626,8 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_PLAIN_PERGROUP_NULLCHECK */
 		struct
 		{
-			int			setoff;
+			AggStatePerGroup *pergroups;
+			int			setno;
 			int			jumpnull;
 		}			agg_plain_pergroup_nullcheck;
 
@@ -634,11 +635,11 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
 		struct
 		{
+			AggStatePerGroup *pergroups;
 			AggStatePerTrans pertrans;
 			ExprContext *aggcontext;
 			int			setno;
 			int			transno;
-			int			setoff;
 		}			agg_trans;
 	}			d;
 } ExprEvalStep;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 94890512dc..d3a56c068e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool nullcheck, bool allow_concurrent_hashing);
 extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
 										 const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
 										 int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 9e70bd8b84..1612b71e05 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -270,21 +270,33 @@ typedef struct AggStatePerGroupData
  */
 typedef struct AggStatePerPhaseData
 {
+	int			phaseidx;		/* phaseidx */
+	bool		is_hashed;		/* plan to do hash aggregate */
 	AggStrategy aggstrategy;	/* strategy for this phase */
-	int			numsets;		/* number of grouping sets (or 0) */
+	int			numsets;		/* number of grouping sets */
 	int		   *gset_lengths;	/* lengths of grouping sets */
 	Bitmapset **grouped_cols;	/* column groupings for rollup */
-	ExprState **eqfunctions;	/* expression returning equality, indexed by
-								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
-
 	/* cached variants of the compiled expression */
-	ExprState  *evaltrans_cache
-				[2]		/* 0: outerops; 1: TTSOpsMinimalTuple */
-				[2];	/* 0: no NULL check; 1: with NULL check */
+	ExprState  *evaltrans_cache[3];
+
+	List		*concurrent_hashes;	/* hash phases can do transition concurrently */
+	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
+
+	bool		skip_evaltrans;		/* do not build evaltrans */
 }			AggStatePerPhaseData;
 
+typedef struct AggStatePerPhaseSortData
+{
+	AggStatePerPhaseData phasedata;
+	Tuplesortstate	*sort_in;		/* sorted input to phases > 1 */
+	Tuplestorestate	*store_in;		/* sorted input to phases > 1 */
+	ExprState 		**eqfunctions;	/* expression returning equality, indexed by
+									 * nr of cols to compare */
+	bool			copy_out;		/* hint for copy input tuples for next phase */
+}			AggStatePerPhaseSortData;
+
 /*
  * AggStatePerHashData - per-hashtable state
  *
@@ -292,8 +304,9 @@ typedef struct AggStatePerPhaseData
  * grouping set. (When doing hashing without grouping sets, we have just one of
  * them.)
  */
-typedef struct AggStatePerHashData
+typedef struct AggStatePerPhaseHashData
 {
+	AggStatePerPhaseData phasedata;
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleHashIterator hashiter; /* for iterating through hash table */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
@@ -304,9 +317,8 @@ typedef struct AggStatePerHashData
 	int			largestGrpColIdx;	/* largest col required for hashing */
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
-	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
-
+	struct HashAggSpill *hash_spill; /* HashAggSpill for current hash grouping set */
+}			AggStatePerPhaseHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 75a45b2549..788ddade64 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2036,7 +2036,8 @@ typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
 typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
-typedef struct AggStatePerHashData *AggStatePerHash;
+typedef struct AggStatePerPhaseSortData *AggStatePerPhaseSort;
+typedef struct AggStatePerPhaseHashData *AggStatePerPhaseHash;
 
 typedef struct AggState
 {
@@ -2068,21 +2069,17 @@ typedef struct AggState
 	List	   *all_grouped_cols;	/* list of all grouped cols in DESC order */
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
-	AggStatePerPhase phases;	/* array of all phases */
+	AggStatePerPhase *phases;	/* array of all phases */
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
-	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
-									 * pointers */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
-	/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
+	/* these fields are used in AGG_HASHED */
 	bool		table_filled;	/* hash table filled yet? */
 	int			num_hashes;
 	MemoryContext	hash_metacxt;	/* memory for hash table itself */
 	struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
-	struct HashAggSpill *hash_spills; /* HashAggSpill for each grouping set,
-										 exists only during first pass */
 	TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
 	List	   *hash_batches;	/* hash batches remaining to be processed */
 	bool		hash_ever_spilled;	/* ever spilled during this execution? */
@@ -2098,18 +2095,16 @@ typedef struct AggState
 										   memory in all hash tables */
 	uint64		hash_disk_used; /* kB of disk space used */
 	int			hash_batches_used;	/* batches used during entire execution */
-
-	AggStatePerHash perhash;	/* array of per-hashtable data */
-	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
-										 * per-group pointers */
+#define HASHNORMALMODE 0		/* normal mode: no minmal slot, no null check */
+#define HASHSPILLMODE 1			/* spill mode: no minmal slot, null check */
+#define HASHREFILLMODE 2		/* refill mode: minmal slot, no null check */
+	int			evaltrans_mode;
 
 	/* these fields are used in AGG_SORTED and AGG_MIXED */
 	bool		input_sorted;	/* hash table filled yet? */
+	int			eflags;			/* eflags for the first sort */
+
 
-	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 50
-	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
-										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
 } AggState;
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 1acbbfad55..09f881d78a 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1004,10 +1004,10 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
  Sort
    Sort Key: (GROUPING("*VALUES*".column1, "*VALUES*".column2)), "*VALUES*".column1, "*VALUES*".column2
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: "*VALUES*".column1, "*VALUES*".column2
          Hash Key: "*VALUES*".column1
          Hash Key: "*VALUES*".column2
-         Group Key: ()
          ->  Values Scan on "*VALUES*"
 (8 rows)
 
@@ -1066,9 +1066,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: unsortable_col
          Sort Key: unhashable_col
            Group Key: unhashable_col
+         Hash Key: unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1108,9 +1108,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: v, unsortable_col
          Sort Key: v, unhashable_col
            Group Key: v, unhashable_col
+         Hash Key: v, unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1149,10 +1149,10 @@ explain (costs off)
            QUERY PLAN           
 --------------------------------
  MixedAggregate
-   Hash Key: a, b
    Group Key: ()
    Group Key: ()
    Group Key: ()
+   Hash Key: a, b
    ->  Seq Scan on gstest_empty
 (6 rows)
 
@@ -1310,10 +1310,10 @@ explain (costs off)
          ->  Sort
                Sort Key: a, b
                ->  MixedAggregate
+                     Group Key: ()
                      Hash Key: a, b
                      Hash Key: a
                      Hash Key: b
-                     Group Key: ()
                      ->  Seq Scan on gstest2
 (11 rows)
 
@@ -1345,10 +1345,10 @@ explain (costs off)
  Sort
    Sort Key: gstest_data.a, gstest_data.b
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: gstest_data.a, gstest_data.b
          Hash Key: gstest_data.a
          Hash Key: gstest_data.b
-         Group Key: ()
          ->  Nested Loop
                ->  Values Scan on "*VALUES*"
                ->  Function Scan on gstest_data
@@ -1545,16 +1545,16 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (12 rows)
 
@@ -1567,12 +1567,12 @@ explain (costs off)
        QUERY PLAN        
 -------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (8 rows)
 
@@ -1586,15 +1586,15 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
-   Hash Key: thousand
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
+   Hash Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (11 rows)
 
@@ -1684,6 +1684,7 @@ group by cube (g1000,g100,g10);
                     QUERY PLAN                     
 ---------------------------------------------------
  MixedAggregate
+   Group Key: ()
    Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
    Hash Key: (g.g % 1000), (g.g % 100)
    Hash Key: (g.g % 1000)
@@ -1691,7 +1692,6 @@ group by cube (g1000,g100,g10);
    Hash Key: (g.g % 100)
    Hash Key: (g.g % 10), (g.g % 1000)
    Hash Key: (g.g % 10)
-   Group Key: ()
    ->  Function Scan on generate_series g
 (10 rows)
 
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3ac6c..7818f02032 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -340,8 +340,8 @@ SELECT c, sum(a) FROM pagg_tab GROUP BY rollup(c) ORDER BY 1, 2;
  Sort
    Sort Key: pagg_tab.c, (sum(pagg_tab.a))
    ->  MixedAggregate
-         Hash Key: pagg_tab.c
          Group Key: ()
+         Hash Key: pagg_tab.c
          ->  Append
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2
-- 
2.14.1

0005-Parallel-grouping-sets.patchapplication/octet-stream; name=0005-Parallel-grouping-sets.patchDownload

From 65937a343d7208c273ed7ff56659dc392f081fb9 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:08:11 -0400
Subject: [PATCH 5/5] Parallel grouping sets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets using multiple workers.

the main idea of parallel grouping sets is: like parallel aggregate,  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multiple grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

The final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizations in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.
---
 src/backend/commands/explain.c             |  10 +-
 src/backend/executor/execExpr.c            |  10 +-
 src/backend/executor/execExprInterp.c      |  11 +
 src/backend/executor/nodeAgg.c             | 272 +++++++++++++++++++++--
 src/backend/jit/llvm/llvmjit_expr.c        |  40 ++++
 src/backend/nodes/copyfuncs.c              |  56 ++++-
 src/backend/nodes/equalfuncs.c             |   3 +
 src/backend/nodes/nodeFuncs.c              |   8 +
 src/backend/nodes/outfuncs.c               |  14 +-
 src/backend/nodes/readfuncs.c              |  53 ++++-
 src/backend/optimizer/path/allpaths.c      |   5 +-
 src/backend/optimizer/plan/createplan.c    |  25 +--
 src/backend/optimizer/plan/planner.c       | 334 ++++++++++++++++++++++-------
 src/backend/optimizer/plan/setrefs.c       |  16 ++
 src/backend/optimizer/util/pathnode.c      |  27 ++-
 src/backend/utils/adt/ruleutils.c          |   6 +
 src/include/executor/execExpr.h            |   1 +
 src/include/executor/nodeAgg.h             |   2 +
 src/include/nodes/execnodes.h              |   8 +-
 src/include/nodes/nodes.h                  |   1 +
 src/include/nodes/pathnodes.h              |   2 +
 src/include/nodes/plannodes.h              |   4 +-
 src/include/nodes/primnodes.h              |   6 +
 src/include/optimizer/pathnode.h           |   1 +
 src/include/optimizer/planmain.h           |   2 +-
 src/test/regress/expected/groupingsets.out | 112 ++++++++++
 src/test/regress/sql/groupingsets.sql      |  64 ++++++
 27 files changed, 968 insertions(+), 125 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 4dec889f77..98bab2e639 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2258,12 +2258,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->gsetid)
+		show_expression((Node *) plan->gsetid, "Filtered by",
+						(PlanState *) astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(plan, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2314,7 +2318,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3533f5ccc8..4ed455d87d 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -815,7 +815,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -824,6 +824,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index b0dbba4e55..b3537eb8d9 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -428,6 +428,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -1512,6 +1513,16 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			AggState   *aggstate = castNode(AggState, state->parent);
+
+			*op->resvalue = aggstate->phase->setno_gsetids[aggstate->current_set];
+			*op->resnull = false;
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 8a8b49547b..01c2ac6b56 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -411,6 +411,7 @@ static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static bool agg_refill_hash_table(AggState *aggstate);
 static void agg_sort_input(AggState *aggstate);
+static void agg_preprocess_groupingsets(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
 static void hash_agg_check_limits(AggState *aggstate);
@@ -490,17 +491,26 @@ initialize_phase(AggState *aggstate, int newphase)
 	 * Whatever the previous state, we're now done with whatever input
 	 * tuplesort was in use, cleanup them.
 	 *
-	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * Note: we keep the first tuplesort/tuplestore when it's not the
+	 * final stage of partial groupingsets, this will benifit the
 	 * rescan in some cases without resorting the input again.
 	 */
-	if (!current_phase->is_hashed && aggstate->current_phase > 0)
+	if (!current_phase->is_hashed &&
+		(aggstate->current_phase > 0 || DO_AGGSPLIT_COMBINE(aggstate->aggsplit)))
 	{
 		persort = (AggStatePerPhaseSort) current_phase;
+
 		if (persort->sort_in)
 		{
 			tuplesort_end(persort->sort_in);
 			persort->sort_in = NULL;
 		}
+
+		if (persort->store_in)
+		{
+			tuplestore_end(persort->store_in);
+			persort->store_in = NULL;
+		}
 	}
 
 	/* advance to next phase */
@@ -569,6 +579,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (current_phase->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(current_phase->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -2133,6 +2152,9 @@ ExecAgg(PlanState *pstate)
 
 	CHECK_FOR_INTERRUPTS();
 
+	if (node->groupingsets_preprocess)
+		agg_preprocess_groupingsets(node);
+
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -2173,7 +2195,7 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	bool		hasGroupingSets = aggstate->phase->aggnode->rollup != NULL;
 	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
@@ -2510,6 +2532,144 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+/*
+ * Routine for final phase of partial grouping sets:
+ *
+ * Preprocess tuples for final phase of grouping sets. In initial phase,
+ * tuples is decorated with a grouping set ID and in the final phase, all
+ * grouping set are fit into different aggregate phases, so we need to
+ * redirect the tuples to each aggregate phases according to the grouping
+ * set ID.
+ */
+static void
+agg_preprocess_groupingsets(AggState *aggstate)
+{
+	AggStatePerPhaseSort	persort;
+	AggStatePerPhaseHash	perhash;
+	AggStatePerPhase	phase;
+	TupleTableSlot		*outerslot;
+	ExprContext			*tmpcontext = aggstate->tmpcontext;
+	int					phaseidx;
+
+	Assert(DO_AGGSPLIT_COMBINE(aggstate->aggsplit));
+	Assert(aggstate->groupingsets_preprocess);
+
+	/* Initialize tuples storage for each aggregate phases */
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
+	{
+		phase = aggstate->phases[phaseidx];
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+			if (phase->aggnode->sortnode)
+			{
+				Sort	   *sortnode = (Sort *) phase->aggnode->sortnode;
+				PlanState  *outerNode = outerPlanState(aggstate);
+				TupleDesc	tupDesc = ExecGetResultType(outerNode);
+
+				persort->sort_in = tuplesort_begin_heap(tupDesc,
+														sortnode->numCols,
+														sortnode->sortColIdx,
+														sortnode->sortOperators,
+														sortnode->collations,
+														sortnode->nullsFirst,
+														work_mem,
+														NULL, false);
+			}
+			else
+			{
+				persort->store_in = tuplestore_begin_heap(false, false, work_mem);
+			}
+		}
+		else
+		{
+			/*
+			 * If it's a AGG_HASHED, we don't need a storage to store
+			 * the tuples for later process, we can do the transition
+			 * immediately.
+			 */
+		}
+	}
+
+	for (;;)
+	{
+		Datum	ret;
+		bool	isNull;
+		int		setid;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tmpcontext->ecxt_outertuple = outerslot;
+
+		/* Finger out which group set the tuple belongs to ?*/
+		ret = ExecEvalExprSwitchContext(aggstate->gsetid, tmpcontext, &isNull);
+
+		setid = DatumGetInt32(ret);
+		phase = aggstate->phases[aggstate->gsetid_phaseidxs[setid]];
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+
+			Assert(persort->sort_in || persort->store_in);
+
+			if (persort->sort_in)
+				tuplesort_puttupleslot(persort->sort_in, outerslot);
+			else if (persort->store_in)
+				tuplestore_puttupleslot(persort->store_in, outerslot);
+		}
+		else
+		{
+			perhash = (AggStatePerPhaseHash) phase;
+
+			/* If it is hashed, we can do the transition now. */
+			aggstate->current_phase = phase->phaseidx;
+			aggstate->phase = phase;
+			select_current_set(aggstate, 0, true);
+			hashagg_recompile_expressions(aggstate);
+
+			lookup_hash_entries(aggstate, perhash, NIL);
+			/* Do the transition */
+			advance_aggregates(aggstate);
+
+			/* Change current phase back to phase 0 */
+			aggstate->current_phase = 0;
+			aggstate->phase = aggstate->phases[0];
+		}
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Sort the first phase if needed */
+	if (aggstate->aggstrategy != AGG_HASHED)
+	{
+		persort = (AggStatePerPhaseSort) aggstate->phase;
+
+		if (persort->sort_in)
+			tuplesort_performsort(persort->sort_in);
+	}
+	else
+	{
+		/*
+		 * If we built hash tables, finalize any spills,
+		 * AGG_MIXED will finalize the spills later
+		 */
+		hashagg_finish_initial_spills(aggstate);
+	}
+
+	/* Mark the hash table to be filled */
+	aggstate->table_filled = true;
+
+	/* Mark the input table to be sorted */
+	aggstate->input_sorted = true;
+
+	/* Mark the flag to not preprocessing groupingsets again */
+	aggstate->groupingsets_preprocess = false;
+}
+
 static void
 agg_sort_input(AggState *aggstate)
 {
@@ -3257,21 +3417,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->eflags = eflags;
 	aggstate->num_hashes = 0;
 	aggstate->hash_spill_mode = HASHNORMALMODE;
+	aggstate->groupingsets_preprocess = false;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			if (agg->aggstrategy != AGG_HASHED)
 				need_extra_slot = true;
@@ -3281,12 +3442,34 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = 1 + list_length(node->chain);
 
+	/*
+	 * We are doing final stage of partial groupingsets, do preprocess
+	 * to input tuples first, redirect the tuples to according aggregate
+	 * phases. See agg_preprocess_groupingsets().
+	 */
+	if (node->rollup && DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		aggstate->groupingsets_preprocess = true;
+
+		/*
+		 * Allocate gsetid <-> phases mapping, in final stage of
+		 * partial groupingsets, all grouping sets are extracted
+		 * to individual phases, so the number of sets is equal
+		 * to the number of phases
+		 */
+		aggstate->gsetid_phaseidxs =
+			(int *) palloc0(aggstate->numphases * sizeof(int));
+
+		if (aggstate->aggstrategy != AGG_HASHED)
+			need_extra_slot = true;
+	}
+
 	/*
 	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
 	if (node->sortnode)
-		aggstate->input_sorted = false;	
+		aggstate->input_sorted = false;
 
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
@@ -3395,6 +3578,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->ss.ps.qual =
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
+	/*
+	 * Initialize expression state to fetch grouping set id from
+	 * the partial groupingsets aggregate result.
+	 */
+	aggstate->gsetid =
+		ExecInitExpr(node->gsetid, (PlanState *)aggstate);
 	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
@@ -3443,6 +3632,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
 
+			/*
+			 * In the initial stage of partial grouping sets, it provides extra
+			 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+			 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+			 * the output.
+			 */
+			if (aggnode->rollup &&
+				DO_AGGSPLIT_SERIALIZE(aggnode->aggsplit))
+			{
+				GroupingSetData	*gs;
+				phasedata->setno_gsetids = palloc(sizeof(int));
+				gs = linitial_node(GroupingSetData,
+								   aggnode->rollup->gsets_data);
+				phasedata->setno_gsetids[0] = gs->setId;
+			}
+
 			/*
 			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
 			 * on the fly, all pergroup states are kept in hashtable, everytime
@@ -3461,8 +3666,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * we can do the transition immediately when a tuple is fetched,
 			 * which means we can do the transition concurrently with the
 			 * first phase.
+			 *
+			 * Note: this is not work for final phase of partial groupingsets in
+			 * which the partial input tuple has a specified target aggregate
+			 * phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				aggstate->phases[0]->concurrent_hashes =
 					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
@@ -3480,17 +3689,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (aggnode->groupingSets)
+			if (aggnode->rollup)
 			{
-				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->numsets = list_length(aggnode->rollup->gsets_data);
 				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
 				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
+				phasedata->setno_gsetids = palloc(phasedata->numsets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int		current_length = list_length(lfirst(l));
-					Bitmapset	*cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
+					Bitmapset *cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -3499,6 +3710,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
 
+					/*
+					 * In the initial stage of partial grouping sets, it provides extra
+					 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+					 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+					 * the output.
+					 */
+					if (DO_AGGSPLIT_SERIALIZE(aggstate->aggsplit))
+						phasedata->setno_gsetids[i] = gs->setId;
+
 					++i;
 				}
 
@@ -3575,8 +3795,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * For non-first AGG_SORTED phase, it processes the same input
 			 * tuples with previous phase except that it need to resort the
 			 * input tuples. Tell the previous phase to copy out the tuples.
+			 *
+			 * Note: it doesn't work for final stage of partial grouping sets
+			 * in which tuple has specified target aggregate phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				AggStatePerPhaseSort prev =
 					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
@@ -3587,6 +3810,18 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 		}
 
+		/*
+		 * Fill the gsetid_phaseidxs array, so we can find according phases
+		 * using gsetid.
+		 */
+		if (aggstate->groupingsets_preprocess)
+		{
+			GroupingSetData *gs =
+				linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+			aggstate->gsetid_phaseidxs[gs->setId] = phaseidx;
+		}
+
 		phasedata->phaseidx = phaseidx;
 		aggstate->phases[phaseidx] = phasedata;
 	}
@@ -4500,6 +4735,8 @@ ExecEndAgg(AggState *node)
 		persort = (AggStatePerPhaseSort) phase;
 		if (persort->sort_in)
 			tuplesort_end(persort->sort_in);
+		if (persort->store_in)
+			tuplestore_end(persort->store_in);
 	}
 
 	hashagg_reset_spill_state(node);
@@ -4700,6 +4937,13 @@ ExecReScanAgg(AggState *node)
 			}
 		}
 
+		/*
+		 * if the agg is doing final stage of partial groupingsets, reset the
+		 * flag to do groupingsets preprocess again.
+		 */
+		if (aggnode->rollup && DO_AGGSPLIT_COMBINE(node->aggsplit))
+			node->groupingsets_preprocess = true;
+
 		/* reset to phase 0 */
 		initialize_phase(node, 0);
 
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 066cd59554..f70eaabd0c 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -1882,6 +1882,46 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_GROUPING_SET_ID:
+				{
+					LLVMValueRef v_resvalue;
+					LLVMValueRef v_aggstatep;
+					LLVMValueRef v_phase;
+					LLVMValueRef v_current_set;
+					LLVMValueRef v_setno_gsetids;
+
+					v_aggstatep =
+						LLVMBuildBitCast(b, v_parent, l_ptr(StructAggState), "");
+
+					/*
+					 * op->resvalue =
+					 * aggstate->phase->setno_gsetids
+					 * [aggstate->current_set]
+					 */
+					v_phase =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_PHASE,
+										  "aggstate.phase");
+					v_setno_gsetids =
+						l_load_struct_gep(b, v_phase,
+										  FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS,
+										  "aggstateperphase.setno_gsetids");
+					v_current_set =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_CURRENT_SET,
+										  "aggstate.current_set");
+					v_resvalue =
+						l_load_gep1(b, v_setno_gsetids, v_current_set, "");
+					v_resvalue =
+						LLVMBuildZExt(b, v_resvalue, TypeSizeT, "");
+
+					LLVMBuildStore(b, v_resvalue, v_resvaluep);
+					LLVMBuildStore(b, l_sbool_const(0), v_resnullp);
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_WINDOW_FUNC:
 				{
 					WindowFuncExprState *wfunc = op->d.window_func.wfstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04b4c65858..de4dcfe165 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -990,8 +990,9 @@ _copyAgg(const Agg *from)
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_SCALAR_FIELD(transitionSpace);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(gsetid);
 	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
@@ -1478,6 +1479,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 	return newnode;
 }
 
+/*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(setId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
 /*
  * _copyWindowFunc
  */
@@ -4972,6 +5017,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5608,6 +5656,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 88b912977e..6aa71d3723 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3078,6 +3078,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index d85ca9f7c5..877ea0bc16 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -62,6 +62,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -740,6 +743,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1869,6 +1875,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2575,6 +2582,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5816d122c1..efcb1c7d4f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -785,8 +785,9 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_UINT64_FIELD(transitionSpace);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(gsetid);
 	WRITE_NODE_FIELD(sortnode);
 }
 
@@ -1150,6 +1151,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 	WRITE_LOCATION_FIELD(location);
 }
 
+static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
 static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
@@ -2002,6 +2010,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(setId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3847,6 +3856,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index af4fcfe1ee..c9a3340f58 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -636,6 +636,50 @@ _readGroupingFunc(void)
 	READ_DONE();
 }
 
+/*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(setId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
 /*
  * _readWindowFunc
  */
@@ -2205,8 +2249,9 @@ _readAgg(void)
 	READ_LONG_FIELD(numGroups);
 	READ_UINT64_FIELD(transitionSpace);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(gsetid);
 	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
@@ -2642,6 +2687,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe77d8..e6c7f080e0 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2710,8 +2710,11 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
-	 * Merge.
+	 * Merge. Don't do this for partial groupingsets.
 	 */
+	if (root->parse->groupingSets)
+		return;
+
 	foreach(lc, rel->partial_pathlist)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e9ad5a98cb..db8822261e 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1641,7 +1641,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 0,
@@ -2095,7 +2095,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
@@ -2214,7 +2214,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	 * never be grouping in an UPDATE/DELETE; but let's Assert that.
 	 */
 	Assert(root->inhTargetKind == INHKIND_NONE);
-	Assert(root->grouping_map == NULL);
 	root->grouping_map = grouping_map;
 
 	/*
@@ -2241,7 +2240,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			 * node if the input is not sorted yet, for other rollups using
 			 * sorted mode, always add an explicit sort.
 			 */
-			if (!rollup->is_hashed)
+			/* In final stage, rollup may contain empty set here */
+			if (!rollup->is_hashed &&
+				list_length(linitial(rollup->gsets)) != 0)
 			{
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
@@ -2265,12 +2266,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
@@ -2282,8 +2283,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	}
 
 	/*
-	 * Now make the real Agg node
-	 */
+	 * Now make the real Agg node */
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
@@ -2315,12 +2315,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
@@ -6222,7 +6222,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 RollupData *rollup, List *chain, double dNumGroups,
 		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6241,8 +6241,9 @@ make_agg(List *tlist, List *qual,
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
+	node->gsetid = NULL;
 	node->sortnode = sortnode;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index f26e962ac9..0a721c9e9b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -113,6 +113,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   num_sets;
 } grouping_sets_data;
 
 /*
@@ -126,6 +127,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*AddPathCallback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -142,7 +145,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -176,7 +180,11 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
 										double dNumGroups,
-										AggStrategy strat);
+										List *havingQual,
+										AggStrategy strat,
+										AggSplit aggsplit,
+										AddPathCallback add_path_fn);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -250,6 +258,9 @@ static bool group_by_has_partkey(RelOptInfo *input_rel,
 								 List *groupClause);
 static int	common_prefix_cmp(const void *a, const void *b);
 
+static List *extract_final_rollups(PlannerInfo *root,
+								   grouping_sets_data *gd,
+								   List *rollups);
 
 /*****************************************************************************
  *
@@ -2494,6 +2505,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->setId = gd->num_sets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2538,7 +2550,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -3547,7 +3559,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3581,6 +3593,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->setId = gd->num_sets++;
 		result = lcons(gs, result);
 	}
 
@@ -4214,9 +4227,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
 							double dNumGroups,
-							AggStrategy strat)
+							List *havingQual,
+							AggStrategy strat,
+							AggSplit aggsplit,
+							AddPathCallback add_path_fn)
 {
-	Query	   *parse = root->parse;
 	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
@@ -4381,16 +4396,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			new_rollups = extract_final_rollups(root, gd, new_rollups);
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 strat,
+											 new_rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
 		return;
 	}
 
@@ -4402,7 +4421,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 	/*
 	 * Callers consider AGG_SORTED strategy, the first rollup must
-	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * use non-hashed aggregate, is_sorted tells whether the first
 	 * rollup need to do its own sort.
 	 *
 	 * we try and make two paths: one sorted and one mixed
@@ -4547,16 +4566,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups,
-											  is_sorted));
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+				rollups = extract_final_rollups(root, gd, rollups);
+
+			add_path_fn(grouped_rel, (Path *)
+						create_groupingsets_path(root,
+												 grouped_rel,
+												 path,
+												 havingQual,
+												 AGG_MIXED,
+												 rollups,
+												 agg_costs,
+												 dNumGroups,
+												 aggsplit,
+												 is_sorted));
 		}
 	}
 
@@ -4564,16 +4587,82 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+	{
+		List *rollups;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rollups = extract_final_rollups(root, gd, gd->rollups);
+		else
+			rollups = gd->rollups;
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 AGG_SORTED,
+											 rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
+	}
+}
+
+/*
+ * If we are combining the partial groupingsets aggregation, the input is
+ * mixed with tuples from different grouping sets, executor dispatch the
+ * tuples to different rollups (phases) according to the grouping set id.
+ *
+ * We cannot use the same rollups with initial stage in which each tuple
+ * is processed by one or more grouping sets in one rollup, because in
+ * combining stage, each tuple only belong to one single grouping set.
+ * In this case, we use final_rollups instead in which each rollup has
+ * only one grouping set.
+ */
+static List *
+extract_final_rollups(PlannerInfo *root,
+					  grouping_sets_data *gd,
+					  List *rollups)
+{
+	ListCell	*lc;
+	List		*new_rollups = NIL;
+
+	foreach(lc, rollups)
+	{
+		ListCell	*lc1;
+		RollupData	*rollup = lfirst_node(RollupData, lc);
+
+		foreach(lc1, rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst_node(GroupingSetData, lc1);
+			RollupData *new_rollup = makeNode(RollupData);
+
+			if (gs->set != NIL)
+			{
+				new_rollup->groupClause = preprocess_groupclause(root, gs->set);
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = remap_to_groupclause_idx(new_rollup->groupClause,
+															 new_rollup->gsets_data,
+															 gd->tleref_to_colnum_map);
+				new_rollup->hashable = rollup->hashable;
+				new_rollup->is_hashed = rollup->is_hashed;
+			}
+			else
+			{
+				new_rollup->groupClause = NIL;
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = list_make1(NIL);
+				new_rollup->hashable = false;
+				new_rollup->is_hashed = false;
+			}
+
+			new_rollup->numGroups = gs->numGroups;
+			new_rollups = lappend(new_rollups, new_rollup);
+		}
+	}
+
+	return new_rollups;
 }
 
 /*
@@ -5283,6 +5372,17 @@ make_partial_grouping_target(PlannerInfo *root,
 
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
+	/*
+	 * We are generate partial groupingsets path, add an expression to show
+	 * the grouping set ID for a tuple, so in the final stage, executor can
+	 * know which set this tuple belongs to and combine them.
+	 * */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
 	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
@@ -6458,7 +6558,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					consider_groupingsets_paths(root, grouped_rel,
 												path, is_sorted, can_hash,
 												gd, agg_costs, dNumGroups,
-												AGG_SORTED);
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_SIMPLE,
+												add_path);
 					continue;
 				}
 
@@ -6519,15 +6622,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
+
+				/*
+				 * Use any available suitably-sorted path as input, and also
+				 * consider sorting the cheapest-total path.
+				 */
+				if (path != partially_grouped_rel->cheapest_total_path &&
+					!is_sorted)
+					continue;
+
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_FINAL_DESERIAL,
+												add_path);
+					continue;
+				}
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
-					if (path != partially_grouped_rel->cheapest_total_path)
-						continue;
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6571,7 +6696,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
 										gd, agg_costs, dNumGroups,
-										AGG_HASHED);
+										havingQual,
+										AGG_HASHED,
+										AGGSPLIT_SIMPLE,
+										add_path);
 		}
 		else
 		{
@@ -6615,23 +6743,39 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGG_HASHED,
+											AGGSPLIT_FINAL_DESERIAL,
+											add_path);
+			}
+			else
+			{
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
 
-			if (enable_hashagg_disk ||
-				hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				if (enable_hashagg_disk ||
+					hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6841,6 +6985,19 @@ create_partial_grouping_paths(PlannerInfo *root,
 											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGG_SORTED,
+												AGGSPLIT_INITIAL_SERIAL,
+												add_partial_path);
+					continue;
+				}
+
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6910,26 +7067,41 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGG_HASHED,
+										AGGSPLIT_INITIAL_SERIAL,
+										add_partial_path);
+		}
+		else
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6973,6 +7145,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	generate_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
+	if (root->parse->groupingSets)
+		return;
+
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
 							   cheapest_partial_path->pathkeys))
@@ -7017,11 +7192,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded506b..eae7d15701 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -754,6 +754,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If it's groupingsets, we must add expression to evaluate
+					 * the grouping set ID and set the reference from the
+					 * targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->gsetid = (Expr *) fix_upper_expr(root, (Node *)expr,
+															  subplan_itlist,
+															  OUTER_VAR,
+															  rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 4578c3184b..f0f7cd57a5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2995,6 +2995,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups,
+						 AggSplit aggsplit,
 						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
@@ -3012,6 +3013,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit = aggsplit;
 	pathnode->is_sorted = is_sorted;
 
 	/*
@@ -3046,11 +3048,27 @@ create_groupingsets_path(PlannerInfo *root,
 	Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
 	Assert(aggstrategy != AGG_MIXED || list_length(rollups) > 1);
 
+	/*
+	 * Estimate the cost of groupingsets.
+	 *
+	 * If we are finalizing grouping sets, the subpath->rows
+	 * contains rows from all sets, we need to estimate the
+	 * number of rows in each rollup. Meanwhile, the cost of
+	 * preprocess groupingsets is not estimated, the expression
+	 * to redirect tuples is a simple Var expression which is
+	 * normally cost zero.
+	 */
 	foreach(lc, rollups)
 	{
 		RollupData *rollup = lfirst(lc);
 		List	   *gsets = rollup->gsets;
 		int			numGroupCols = list_length(linitial(gsets));
+		int			rows = 0.0;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rows = rollup->numGroups * subpath->rows / numGroups;
+		else
+			rows = subpath->rows;
 
 		/*
 		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
@@ -3072,7 +3090,7 @@ create_groupingsets_path(PlannerInfo *root,
 
 				cost_sort(&sort_path, root, NIL,
 						  input_total_cost,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3090,7 +3108,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 input_startup_cost,
 					 input_total_cost,
-					 subpath->rows,
+					 rows,
 					 subpath->pathtarget->width);
 			is_first = false;
 		}
@@ -3102,7 +3120,6 @@ create_groupingsets_path(PlannerInfo *root,
 
 			sort_path.startup_cost = 0;
 			sort_path.total_cost = 0;
-			sort_path.rows = subpath->rows;
 
 			rollup_strategy = rollup->is_hashed ?
 				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
@@ -3112,7 +3129,7 @@ create_groupingsets_path(PlannerInfo *root,
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
 						  0.0,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3128,7 +3145,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 sort_path.startup_cost,
 					 sort_path.total_cost,
-					 sort_path.rows,
+					 rows,
 					 subpath->pathtarget->width);
 
 			pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5e63238f03..5779d158ba 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -7941,6 +7941,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 4ed5d0a7de..4d36c2d77b 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 1612b71e05..67b728ae73 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -285,6 +285,8 @@ typedef struct AggStatePerPhaseData
 	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
 
 	bool		skip_evaltrans;		/* do not build evaltrans */
+#define FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS 12
+	int			*setno_gsetids;		/* setno <-> gsetid map */
 }			AggStatePerPhaseData;
 
 typedef struct AggStatePerPhaseSortData
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 788ddade64..4f591bf1ca 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2047,6 +2047,7 @@ typedef struct AggState
 	int			numtrans;		/* number of pertrans items */
 	AggStrategy aggstrategy;	/* strategy mode */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+#define FIELDNO_AGGSTATE_PHASE 6
 	AggStatePerPhase phase;		/* pointer to current phase data */
 	int			numphases;		/* number of phases (including phase 0) */
 	int			current_phase;	/* current phase number */
@@ -2070,8 +2071,6 @@ typedef struct AggState
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
 	AggStatePerPhase *phases;	/* array of all phases */
-	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
-	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
@@ -2106,6 +2105,11 @@ typedef struct AggState
 
 
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* these field are used in parallel grouping sets */
+	bool		groupingsets_preprocess; /* groupingsets preprocessed yet? */
+	ExprState	*gsetid;				/* expression state to get grpsetid from input */
+	int			*gsetid_phaseidxs;	/* grpsetid <-> phaseidx mapping */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe8cc..a48a7af0e3 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c1e69c808f..2761fa6d01 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1676,6 +1676,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			setId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1702,6 +1703,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3cd2537e9e..5b1239adf2 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -816,8 +817,9 @@ typedef struct Agg
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Expr	   *gsetid;			/* expression to fetch grouping set id */
 	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d73be2ad46..f8f85d431a 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -364,6 +364,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* GroupingSetId */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f9f388ba06..4fde8b22bf 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups,
+												  AggSplit aggsplit,
 												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5954ff3997..e987011328 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 RollupData *rollup, List *chain, double dNumGroups,
 					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 09f881d78a..8c01e0394b 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1744,4 +1744,116 @@ drop table gs_hash_1;
 drop table gs_hash_2;
 drop table gs_hash_3;
 SET enable_groupingsets_hash_disk TO DEFAULT;
+--
+-- Compare results between parallel plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+create table gstest_p as select g%100 as g100, g%10 as g10, g
+from generate_series(0,199999) g;
+ANALYZE gstest_p;
+-- Prepared sort agg without parallelism
+set enable_hashagg = off;
+set min_parallel_table_scan_size = '128MB';
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+         QUERY PLAN         
+----------------------------
+ GroupAggregate
+   Sort Key: g100, g10
+     Group Key: g100, g10
+     Group Key: g100
+     Group Key: ()
+   Sort Key: g10
+     Group Key: g10
+   ->  Seq Scan on gstest_p
+(8 rows)
+
+create table p_gs_group_1 as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+-- Prepare sort agg with parallelism
+set min_parallel_table_scan_size = '4kB';
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+                   QUERY PLAN                    
+-------------------------------------------------
+ Finalize GroupAggregate
+   Filtered by: (GROUPINGSETID())
+   Sort Key: g100, g10
+     Group Key: g100, g10
+   Sort Key: g100
+     Group Key: g100
+   Group Key: ()
+   Sort Key: g10
+     Group Key: g10
+   ->  Gather
+         Workers Planned: 2
+         ->  Partial GroupAggregate
+               Sort Key: g100, g10
+                 Group Key: g100, g10
+                 Group Key: g100
+                 Group Key: ()
+               Sort Key: g10
+                 Group Key: g10
+               ->  Parallel Seq Scan on gstest_p
+(19 rows)
+
+create table p_gs_group_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+-- Prepare hash agg with parallelism
+SET enable_groupingsets_hash_disk = true;
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+                   QUERY PLAN                    
+-------------------------------------------------
+ Finalize MixedAggregate
+   Filtered by: (GROUPINGSETID())
+   Group Key: ()
+   Hash Key: g100, g10
+   Hash Key: g100
+   Hash Key: g10
+   ->  Gather
+         Workers Planned: 2
+         ->  Partial MixedAggregate
+               Group Key: ()
+               Hash Key: g100, g10
+               Hash Key: g100
+               Hash Key: g10
+               ->  Parallel Seq Scan on gstest_p
+(14 rows)
+
+create table p_gs_hash_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+RESET enable_sort;
+RESET work_mem;
+RESET enable_groupingsets_hash_disk;
+RESET min_parallel_table_scan_size;
+-- Compare results
+(select * from p_gs_group_1 except select * from p_gs_group_1_p)
+  union all
+(select * from p_gs_group_1_p except select * from p_gs_group_1);
+ g100 | g10 | sum | count | max 
+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from p_gs_group_1 except select * from p_gs_hash_1_p)
+  union all
+(select * from p_gs_hash_1_p except select * from p_gs_group_1);
+ g100 | g10 | sum | count | max 
+------+-----+-----+-------+-----
+(0 rows)
+
+drop table gstest_p;
+drop table p_gs_group_1;
+drop table p_gs_group_1_p;
+drop table p_gs_hash_1_p;
 -- end
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 478f49ecab..427b710b39 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -544,4 +544,68 @@ drop table gs_hash_3;
 
 SET enable_groupingsets_hash_disk TO DEFAULT;
 
+--
+-- Compare results between parallel plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+create table gstest_p as select g%100 as g100, g%10 as g10, g
+from generate_series(0,199999) g;
+ANALYZE gstest_p;
+
+-- Prepared sort agg without parallelism
+set enable_hashagg = off;
+set min_parallel_table_scan_size = '128MB';
+
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+create table p_gs_group_1 as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+-- Prepare sort agg with parallelism
+set min_parallel_table_scan_size = '4kB';
+
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+create table p_gs_group_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+-- Prepare hash agg with parallelism
+SET enable_groupingsets_hash_disk = true;
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+create table p_gs_hash_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+RESET enable_sort;
+RESET work_mem;
+RESET enable_groupingsets_hash_disk;
+RESET min_parallel_table_scan_size;
+
+-- Compare results
+(select * from p_gs_group_1 except select * from p_gs_group_1_p)
+  union all
+(select * from p_gs_group_1_p except select * from p_gs_group_1);
+
+(select * from p_gs_group_1 except select * from p_gs_hash_1_p)
+  union all
+(select * from p_gs_hash_1_p except select * from p_gs_group_1);
+
+drop table gstest_p;
+drop table p_gs_group_1;
+drop table p_gs_group_1_p;
+drop table p_gs_hash_1_p;
 -- end
-- 
2.14.1

#27

Tomas Vondra

tomas.vondra@2ndquadrant.com

almost 6 years ago

In reply to: Pengzhou Tang (#26)

2 attachment(s)

Re: Parallel grouping sets

On Fri, Mar 20, 2020 at 07:57:02PM +0800, Pengzhou Tang wrote:

Hi Tomas,

I rebased the code and resolved the comments you attached, some unresolved
comments are explained in 0002-fixes.patch, please take a look.

I also make the hash spill working for parallel grouping sets, the plan
looks like:

gpadmin=# explain select g100, g10, sum(g::numeric), count(*), max(g::text)
from gstest_p group by cube (g100,g10);
QUERY PLAN
-------------------------------------------------------------------------------------------
Finalize MixedAggregate (cost=1000.00..7639.95 rows=1111 width=80)
Filtered by: (GROUPINGSETID())
Group Key: ()
Hash Key: g100, g10
Hash Key: g100
Hash Key: g10
Planned Partitions: 4
-> Gather (cost=1000.00..6554.34 rows=7777 width=84)
Workers Planned: 7
-> Partial MixedAggregate (cost=0.00..4776.64 rows=1111 width=84)
Group Key: ()
Hash Key: g100, g10
Hash Key: g100
Hash Key: g10
Planned Partitions: 4
-> Parallel Seq Scan on gstest_p (cost=0.00..1367.71
rows=28571 width=12)
(16 rows)

Hmmm, OK. I think there's some sort of memory leak, though. I've tried
running a simple grouping set query on catalog_sales table from TPC-DS
scale 100GB test. The query is pretty simple:

select count(*) from catalog_sales
group by cube (cs_warehouse_sk, cs_ship_mode_sk, cs_call_center_sk);

with a partial MixedAggregate plan (attached). When executed, it however
allocates more and more memory, and eventually gets killed by an OOM
killer. This is on a machine with 8GB of RAM, work_mem=4MB (and 4
parallel workers).

The memory context stats from a running process before it gets killed by
OOM look like this

TopMemoryContext: 101560 total in 6 blocks; 7336 free (6 chunks); 94224 used
TopTransactionContext: 73816 total in 4 blocks; 11624 free (0 chunks); 62192 used
ExecutorState: 1375731712 total in 174 blocks; 5391392 free (382 chunks); 1370340320 used
HashAgg meta context: 315784 total in 10 blocks; 15400 free (2 chunks); 300384 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
...

That's 1.3GB allocated in ExecutorState - that doesn't seem right.

FWIW there are only very few groups (each attribute has fewer than 30
distinct values, so there's only about ~1000 groups. On master it works
just fine, of course.

regards

--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#28

Pengzhou Tang

ptang@pivotal.io

almost 6 years ago

In reply to: Tomas Vondra (#27)

5 attachment(s)

Re: Parallel grouping sets

The memory context stats from a running process before it gets killed by
OOM look like this

TopMemoryContext: 101560 total in 6 blocks; 7336 free (6 chunks); 94224
used
TopTransactionContext: 73816 total in 4 blocks; 11624 free (0
chunks); 62192 used
ExecutorState: 1375731712 total in 174 blocks; 5391392 free (382
chunks); 1370340320 used
HashAgg meta context: 315784 total in 10 blocks; 15400 free (2
chunks); 300384 used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264
used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264
used
ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264
used
...

That's 1.3GB allocated in ExecutorState - that doesn't seem right.

FWIW there are only very few groups (each attribute has fewer than 30
distinct values, so there's only about ~1000 groups. On master it works
just fine, of course.

Thanks a lot, the patch has a memory leak in the lookup_hash_entries, it
uses a list_concat there
and causes a 64-byte leak for every tuple, has fixed that.

Also, resolved conflicts and rebased the code.

Thanks,
Pengzhou

Attachments:

0003-fix-a-numtrans-bug.patchapplication/octet-stream; name=0003-fix-a-numtrans-bug.patchDownload

From f9f013dc3e9eb15e0bc9929adf4bee16e0049180 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Thu, 12 Mar 2020 04:38:36 -0400
Subject: [PATCH 3/5] fix a numtrans bug

aggstate->numtrans is always zero when building the hash table for
hash aggregates, this make the additional size of hash table not
correct.
---
 src/backend/executor/nodeAgg.c | 67 +++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 31 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b02431c..b4d652f 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -3584,39 +3584,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	if (use_hashing)
 	{
-		Plan   *outerplan = outerPlan(node);
-		uint64	totalGroups = 0;
-		int 	i;
-
-		aggstate->hash_metacxt = AllocSetContextCreate(
-			aggstate->ss.ps.state->es_query_cxt,
-			"HashAgg meta context",
-			ALLOCSET_DEFAULT_SIZES);
-		aggstate->hash_spill_slot = ExecInitExtraTupleSlot(
-			estate, scanDesc, &TTSOpsMinimalTuple);
-
 		/* this is an array of pointers, not structures */
 		aggstate->hash_pergroup = pergroups;
-
-		aggstate->hashentrysize = hash_agg_entry_size(
-			aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
-
-		/*
-		 * Consider all of the grouping sets together when setting the limits
-		 * and estimating the number of partitions. This can be inaccurate
-		 * when there is more than one grouping set, but should still be
-		 * reasonable.
-		 */
-		for (i = 0; i < aggstate->num_hashes; i++)
-			totalGroups += aggstate->perhash[i].aggnode->numGroups;
-
-		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
-							&aggstate->hash_mem_limit,
-							&aggstate->hash_ngroups_limit,
-							&aggstate->hash_planned_partitions);
-		find_hash_columns(aggstate);
-		build_hash_tables(aggstate);
-		aggstate->table_filled = false;
 	}
 
 	/*
@@ -3972,6 +3941,42 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	/* Initialize hash contexts and hash tables for hash aggregates */
+	if (use_hashing)
+	{
+		Plan   *outerplan = outerPlan(node);
+		uint64	totalGroups = 0;
+		int 	i;
+
+		aggstate->hash_metacxt = AllocSetContextCreate(
+			aggstate->ss.ps.state->es_query_cxt,
+			"HashAgg meta context",
+			ALLOCSET_DEFAULT_SIZES);
+		aggstate->hash_spill_slot = ExecInitExtraTupleSlot(
+			estate, scanDesc, &TTSOpsMinimalTuple);
+
+		aggstate->hashentrysize = hash_agg_entry_size(
+			aggstate->numtrans, outerplan->plan_width, node->transitionSpace);
+
+		/*
+		 * Consider all of the grouping sets together when setting the limits
+		 * and estimating the number of partitions. This can be inaccurate
+		 * when there is more than one grouping set, but should still be
+		 * reasonable.
+		 */
+		for (i = 0; i < aggstate->num_hashes; i++)
+			totalGroups += aggstate->perhash[i].aggnode->numGroups;
+
+		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+							&aggstate->hash_mem_limit,
+							&aggstate->hash_ngroups_limit,
+							&aggstate->hash_planned_partitions);
+
+		find_hash_columns(aggstate);
+		build_hash_tables(aggstate);
+		aggstate->table_filled = false;
+	}
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
-- 
1.8.3.1

0001-All-grouping-sets-do-their-own-sorting.patchapplication/octet-stream; name=0001-All-grouping-sets-do-their-own-sorting.patchDownload

From 99f331c340dad4c0ed32dca33d7781ee7e2a8109 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:07:29 -0400
Subject: [PATCH 1/5] All grouping sets do their own sorting
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

PG used to add a SORT path explicitly beneath the AGG for sort aggregate,
grouping sets path also add a SORT path for the first sort aggregate phase,
but the following sort aggregate phases do their own sorting using a tuplesort.
This commit unified the way how grouping sets path doing sort, all sort aggregate
phases now do their own sorting using tuplesort.

This commit is mainly a preparing step to support parallel grouping sets, the
main idea of parallel grouping sets is: like parallel aggregate,  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multiple grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

The final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizations in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.

Obviously, adding a SORT path underneath the AGG in the final stage is not
right. This commit can avoid it and all non-hashed aggregate phases can do
their own sorting after thetuples are redirected.
---
 src/backend/commands/explain.c             |   5 +-
 src/backend/executor/nodeAgg.c             |  79 +++++++++++---
 src/backend/nodes/copyfuncs.c              |   1 +
 src/backend/nodes/outfuncs.c               |   1 +
 src/backend/nodes/readfuncs.c              |   1 +
 src/backend/optimizer/plan/createplan.c    |  65 ++++++++----
 src/backend/optimizer/plan/planner.c       |  66 ++++++++----
 src/backend/optimizer/util/pathnode.c      |  30 +++++-
 src/include/executor/nodeAgg.h             |   2 -
 src/include/nodes/execnodes.h              |   5 +-
 src/include/nodes/pathnodes.h              |   1 +
 src/include/nodes/plannodes.h              |   1 +
 src/include/optimizer/pathnode.h           |   3 +-
 src/include/optimizer/planmain.h           |   2 +-
 src/test/regress/expected/groupingsets.out | 161 ++++++++++++++---------------
 15 files changed, 275 insertions(+), 148 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ff2f45c..6914d18 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2291,15 +2291,14 @@ show_grouping_sets(PlanState *planstate, Agg *agg,
 
 	ExplainOpenGroup("Grouping Sets", "Grouping Sets", false, es);
 
-	show_grouping_set_keys(planstate, agg, NULL,
+	show_grouping_set_keys(planstate, agg, (Sort *) agg->sortnode,
 						   context, useprefix, ancestors, es);
 
 	foreach(lc, agg->chain)
 	{
 		Agg		   *aggnode = lfirst(lc);
-		Sort	   *sortnode = (Sort *) aggnode->plan.lefttree;
 
-		show_grouping_set_keys(planstate, aggnode, sortnode,
+		show_grouping_set_keys(planstate, aggnode, (Sort *) aggnode->sortnode,
 							   context, useprefix, ancestors, es);
 	}
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2a6f44a..0a63980 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -404,6 +404,7 @@ static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
 										  bool *in_hash_table);
 static void lookup_hash_entries(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static bool agg_refill_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
@@ -516,7 +517,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = aggstate->phases[newphase + 1].sortnode;
+		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -2116,6 +2117,8 @@ ExecAgg(PlanState *pstate)
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+				if (!node->input_sorted)
+					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
 				break;
 		}
@@ -2473,6 +2476,45 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static void
+agg_sort_input(AggState *aggstate)
+{
+	AggStatePerPhase phase = &aggstate->phases[1];
+	TupleDesc	tupDesc;
+	Sort		*sortnode;
+
+	Assert(!aggstate->input_sorted);
+	Assert(phase->aggnode->sortnode);
+
+	sortnode = (Sort *) phase->aggnode->sortnode;
+	tupDesc = ExecGetResultType(outerPlanState(aggstate));
+
+	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
+											 sortnode->numCols,
+											 sortnode->sortColIdx,
+											 sortnode->sortOperators,
+											 sortnode->collations,
+											 sortnode->nullsFirst,
+											 work_mem,
+											 NULL, false);
+	for (;;)
+	{
+		TupleTableSlot *outerslot;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+	}
+
+	/* Sort the first phase */
+	tuplesort_performsort(aggstate->sort_in);
+
+	/* Mark the input to be sorted */
+	aggstate->input_sorted = true;
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -3143,6 +3185,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
+	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
@@ -3187,6 +3230,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->grp_firstTuple = NULL;
 	aggstate->sort_in = NULL;
 	aggstate->sort_out = NULL;
+	aggstate->input_sorted = true;
 
 	/*
 	 * phases[0] always exists, but is dummy in sorted/plain mode
@@ -3194,6 +3238,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	numPhases = (use_hashing ? 1 : 2);
 	numHashes = (use_hashing ? 1 : 0);
 
+	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.  Also calculate the number of
@@ -3215,7 +3261,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * others add an extra phase.
 			 */
 			if (agg->aggstrategy != AGG_HASHED)
+			{
 				++numPhases;
+
+				if (!firstSortAgg)
+					firstSortAgg = agg;
+
+			}
 			else
 				++numHashes;
 		}
@@ -3224,6 +3276,13 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->maxsets = numGroupingSets;
 	aggstate->numphases = numPhases;
 
+	/*
+	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * agg_sort_input(), this can only happen in groupingsets case.
+	 */
+	if (firstSortAgg && firstSortAgg->sortnode)
+		aggstate->input_sorted = false;	
+
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
 
@@ -3285,7 +3344,8 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * If there are more than two phases (including a potential dummy phase
 	 * 0), input will be resorted using tuplesort. Need a slot for that.
 	 */
-	if (numPhases > 2)
+	if (numPhases > 2 ||
+		!aggstate->input_sorted)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -3356,20 +3416,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
 	{
 		Agg		   *aggnode;
-		Sort	   *sortnode;
 
 		if (phaseidx > 0)
-		{
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
-			sortnode = castNode(Sort, aggnode->plan.lefttree);
-		}
 		else
-		{
 			aggnode = node;
-			sortnode = NULL;
-		}
-
-		Assert(phase <= 1 || sortnode);
 
 		if (aggnode->aggstrategy == AGG_HASHED
 			|| aggnode->aggstrategy == AGG_MIXED)
@@ -3486,7 +3537,6 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
-			phasedata->sortnode = sortnode;
 		}
 	}
 
@@ -4621,6 +4671,10 @@ ExecReScanAgg(AggState *node)
 				   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
+		/* Reset input_sorted */
+		if (aggnode->sortnode)
+			node->input_sorted = false;
+
 		/* reset to phase 1 */
 		initialize_phase(node, 1);
 
@@ -4628,6 +4682,7 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
+
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index eaab97f..04b4c65 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -992,6 +992,7 @@ _copyAgg(const Agg *from)
 	COPY_BITMAPSET_FIELD(aggParams);
 	COPY_NODE_FIELD(groupingSets);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
 }
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e084c3f..5816d12 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -787,6 +787,7 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_BITMAPSET_FIELD(aggParams);
 	WRITE_NODE_FIELD(groupingSets);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(sortnode);
 }
 
 static void
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index d5b23a3..af4fcfe 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -2207,6 +2207,7 @@ _readAgg(void)
 	READ_BITMAPSET_FIELD(aggParams);
 	READ_NODE_FIELD(groupingSets);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index fc25908..d5b3408 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1645,6 +1645,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 NIL,
 								 best_path->path.rows,
 								 0,
+								 NULL,
 								 subplan);
 	}
 	else
@@ -2098,6 +2099,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
+					NULL,
 					subplan);
 
 	copy_generic_path_info(&plan->plan, (Path *) best_path);
@@ -2159,6 +2161,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	List	   *rollups = best_path->rollups;
 	AttrNumber *grouping_map;
 	int			maxref;
+	int			flags = CP_LABEL_TLIST;
 	List	   *chain;
 	ListCell   *lc;
 
@@ -2168,9 +2171,15 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available
+	 * we do need grouping columns to be available; If the groupingsets need
+	 * to sort the input, the agg will store the input rows in a tuplesort,
+	 * it therefore behooves us to request a small tlist to avoid wasting
+	 * spaces.
 	 */
-	subplan = create_plan_recurse(root, best_path->subpath, CP_LABEL_TLIST);
+	if (!best_path->is_sorted)
+		flags = flags | CP_SMALL_TLIST;
+
+	subplan = create_plan_recurse(root, best_path->subpath, flags);
 
 	/*
 	 * Compute the mapping from tleSortGroupRef to column index in the child's
@@ -2230,12 +2239,22 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-			if (!rollup->is_hashed && !is_first_sort)
+			if (!rollup->is_hashed)
 			{
-				sort_plan = (Plan *)
-					make_sort_from_groupcols(rollup->groupClause,
-											 new_grpColIdx,
-											 subplan);
+				if (!is_first_sort ||
+					(is_first_sort && !best_path->is_sorted))
+				{
+					sort_plan = (Plan *)
+						make_sort_from_groupcols(rollup->groupClause,
+												 new_grpColIdx,
+												 subplan);
+
+					/*
+					 * Remove stuff we don't need to avoid bloating debug output.
+					 */
+					sort_plan->targetlist = NIL;
+					sort_plan->lefttree = NULL;
+				}
 			}
 
 			if (!rollup->is_hashed)
@@ -2260,16 +2279,8 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
-										 sort_plan);
-
-			/*
-			 * Remove stuff we don't need to avoid bloating debug output.
-			 */
-			if (sort_plan)
-			{
-				sort_plan->targetlist = NIL;
-				sort_plan->lefttree = NULL;
-			}
+										 sort_plan,
+										 NULL);
 
 			chain = lappend(chain, agg_plan);
 		}
@@ -2281,10 +2292,26 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
+		Plan	   *sort_plan = NULL;
 		int			numGroupCols;
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+		/* the input is not sorted yet */
+		if (!rollup->is_hashed &&
+			!best_path->is_sorted)
+		{
+			sort_plan = (Plan *)
+				make_sort_from_groupcols(rollup->groupClause,
+										 top_grpColIdx,
+										 subplan);
+			/*
+			 * Remove stuff we don't need to avoid bloating debug output.
+			 */
+			sort_plan->targetlist = NIL;
+			sort_plan->lefttree = NULL;
+		}
+
 		numGroupCols = list_length((List *) linitial(rollup->gsets));
 
 		plan = make_agg(build_path_tlist(root, &best_path->path),
@@ -2299,6 +2326,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
+						sort_plan,
 						subplan);
 
 		/* Copy cost data from Path to Plan */
@@ -6197,7 +6225,7 @@ make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 		 List *groupingSets, List *chain, double dNumGroups,
-		 Size transitionSpace, Plan *lefttree)
+		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
 	Plan	   *plan = &node->plan;
@@ -6217,6 +6245,7 @@ make_agg(List *tlist, List *qual,
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
 	node->groupingSets = groupingSets;
 	node->chain = chain;
+	node->sortnode = sortnode;
 
 	plan->qual = qual;
 	plan->targetlist = tlist;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index b65abf6..6110f38 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -175,7 +175,8 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										bool can_hash,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
-										double dNumGroups);
+										double dNumGroups,
+										AggStrategy strat);
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -4183,6 +4184,14 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * it, by combinations of hashing and sorting.  This can be called multiple
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
+ *
+ * - strat:
+ *   preferred aggregate strategy to use.
+ * 
+ * - is_sorted:
+ *   Is the input sorted on the groupCols of the first rollup. Caller
+ *   must set it correctly if strat is set to AGG_SORTED, the planner
+ *   uses it to generate a sortnode.
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4192,13 +4201,15 @@ consider_groupingsets_paths(PlannerInfo *root,
 							bool can_hash,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
-							double dNumGroups)
+							double dNumGroups,
+							AggStrategy strat)
 {
 	Query	   *parse = root->parse;
+	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
-	 * If we're not being offered sorted input, then only consider plans that
-	 * can be done entirely by hashing.
+	 * If strat is AGG_HASHED, then only consider plans that can be done
+	 * entirely by hashing.
 	 *
 	 * We can hash everything if it looks like it'll fit in work_mem. But if
 	 * the input is actually sorted despite not being advertised as such, we
@@ -4207,7 +4218,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * If none of the grouping sets are sortable, then ignore the work_mem
 	 * limit and generate a path anyway, since otherwise we'll just fail.
 	 */
-	if (!is_sorted)
+	if (strat == AGG_HASHED)
 	{
 		List	   *new_rollups = NIL;
 		RollupData *unhashed_rollup = NULL;
@@ -4248,6 +4259,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
+			/* update is_sorted to true */
+			is_sorted = true;
 		}
 
 		hashsize = estimate_hashagg_tablesize(path,
@@ -4346,6 +4359,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
+			/* update is_sorted to true */
+			is_sorted = true;
 			strat = AGG_MIXED;
 		}
 
@@ -4357,18 +4372,23 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  strat,
 										  new_rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 		return;
 	}
 
 	/*
-	 * If we have sorted input but nothing we can do with it, bail.
+	 * Strategy is AGG_SORTED but nothing we can do with it, bail.
 	 */
 	if (list_length(gd->rollups) == 0)
 		return;
 
 	/*
-	 * Given sorted input, we try and make two paths: one sorted and one mixed
+	 * Callers consider AGG_SORTED strategy, the first rollup must
+	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * rollup need to do its own sort.
+	 *
+	 * we try and make two paths: one sorted and one mixed
 	 * sort/hash. (We need to try both because hashagg might be disabled, or
 	 * some columns might not be sortable.)
 	 *
@@ -4425,7 +4445,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that matches the input sort order.  We assign indexes "i"
+			 * one that need to be sorted.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -4514,7 +4534,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 											  AGG_MIXED,
 											  rollups,
 											  agg_costs,
-											  dNumGroups));
+											  dNumGroups,
+											  is_sorted));
 		}
 	}
 
@@ -4530,7 +4551,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 										  AGG_SORTED,
 										  gd->rollups,
 										  agg_costs,
-										  dNumGroups));
+										  dNumGroups,
+										  is_sorted));
 }
 
 /*
@@ -6397,6 +6419,16 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					/* consider AGG_SORTED strategy */
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_costs, dNumGroups,
+												AGG_SORTED);
+					continue;
+				}
+
 				/* Sort the cheapest-total path if it isn't already sorted */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6405,14 +6437,7 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 													 root->group_pathkeys,
 													 -1.0);
 
-				/* Now decide what to stick atop it */
-				if (parse->groupingSets)
-				{
-					consider_groupingsets_paths(root, grouped_rel,
-												path, true, can_hash,
-												gd, agg_costs, dNumGroups);
-				}
-				else if (parse->hasAggs)
+				if (parse->hasAggs)
 				{
 					/*
 					 * We have aggregation, possibly with plain GROUP BY. Make
@@ -6512,7 +6537,8 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			 */
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
-										gd, agg_costs, dNumGroups);
+										gd, agg_costs, dNumGroups,
+										AGG_HASHED);
 		}
 		else
 		{
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 8ba8122..6e88992 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2984,6 +2984,7 @@ create_agg_path(PlannerInfo *root,
  * 'rollups' is a list of RollupData nodes
  * 'agg_costs' contains cost info about the aggregate functions to be computed
  * 'numGroups' is the estimated total number of groups
+ * 'is_sorted' is the input sorted in the group cols of first rollup
  */
 GroupingSetsPath *
 create_groupingsets_path(PlannerInfo *root,
@@ -2993,7 +2994,8 @@ create_groupingsets_path(PlannerInfo *root,
 						 AggStrategy aggstrategy,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
-						 double numGroups)
+						 double numGroups,
+						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
 	PathTarget *target = rel->reltarget;
@@ -3011,6 +3013,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->is_sorted = is_sorted;
 
 	/*
 	 * Simplify callers by downgrading AGG_SORTED to AGG_PLAIN, and AGG_MIXED
@@ -3062,14 +3065,33 @@ create_groupingsets_path(PlannerInfo *root,
 		 */
 		if (is_first)
 		{
+			Cost	input_startup_cost = subpath->startup_cost;
+			Cost	input_total_cost = subpath->total_cost;
+
+			if (!rollup->is_hashed && !is_sorted && numGroupCols)
+			{
+				Path		sort_path;	/* dummy for result of cost_sort */
+
+				cost_sort(&sort_path, root, NIL,
+						  input_total_cost,
+						  subpath->rows,
+						  subpath->pathtarget->width,
+						  0.0,
+						  work_mem,
+						  -1.0);
+
+				input_startup_cost = sort_path.startup_cost;
+				input_total_cost = sort_path.total_cost;
+			}
+
 			cost_agg(&pathnode->path, root,
 					 aggstrategy,
 					 agg_costs,
 					 numGroupCols,
 					 rollup->numGroups,
 					 having_qual,
-					 subpath->startup_cost,
-					 subpath->total_cost,
+					 input_startup_cost,
+					 input_total_cost,
 					 subpath->rows,
 					 subpath->pathtarget->width);
 			is_first = false;
@@ -3081,7 +3103,7 @@ create_groupingsets_path(PlannerInfo *root,
 			Path		sort_path;	/* dummy for result of cost_sort */
 			Path		agg_path;	/* dummy for result of cost_agg */
 
-			if (rollup->is_hashed || is_first_sort)
+			if (rollup->is_hashed || (is_first_sort && is_sorted))
 			{
 				/*
 				 * Account for cost of aggregation, but don't charge input
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index a5b8a00..9e70bd8 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -277,8 +277,6 @@ typedef struct AggStatePerPhaseData
 	ExprState **eqfunctions;	/* expression returning equality, indexed by
 								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
-	Sort	   *sortnode;		/* Sort node for input ordering for phase */
-
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
 
 	/* cached variants of the compiled expression */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3d27d50..75a45b2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2103,8 +2103,11 @@ typedef struct AggState
 	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
 										 * per-group pointers */
 
+	/* these fields are used in AGG_SORTED and AGG_MIXED */
+	bool		input_sorted;	/* hash table filled yet? */
+
 	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 49
+#define FIELDNO_AGGSTATE_ALL_PERGROUPS 50
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 0ceb809..c1e69c8 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1702,6 +1702,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
 /*
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 4869fe7..3cd2537 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -818,6 +818,7 @@ typedef struct Agg
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
 	List	   *groupingSets;	/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
 /* ----------------
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index e450fe1..f9f388b 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -217,7 +217,8 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  AggStrategy aggstrategy,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
-												  double numGroups);
+												  double numGroups,
+												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
 											PathTarget *target,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 4781201..5954ff3 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -55,7 +55,7 @@ extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
 					 List *groupingSets, List *chain, double dNumGroups,
-					 Size transitionSpace, Plan *lefttree);
+					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
 /*
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 05ff204..1cb9700 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -366,15 +366,14 @@ explain (costs off)
 select g as alias1, g as alias2
   from generate_series(1,3) g
  group by alias1, rollup(alias2);
-                   QUERY PLAN                   
-------------------------------------------------
+                QUERY PLAN                
+------------------------------------------
  GroupAggregate
-   Group Key: g, g
-   Group Key: g
-   ->  Sort
-         Sort Key: g
-         ->  Function Scan on generate_series g
-(6 rows)
+   Sort Key: g, g
+     Group Key: g, g
+     Group Key: g
+   ->  Function Scan on generate_series g
+(5 rows)
 
 select g as alias1, g as alias2
   from generate_series(1,3) g
@@ -640,15 +639,14 @@ select a, b, sum(v.x)
 -- Test reordering of grouping sets
 explain (costs off)
 select * from gstest1 group by grouping sets((a,b,v),(v)) order by v,b,a;
-                                  QUERY PLAN                                  
-------------------------------------------------------------------------------
+                                QUERY PLAN                                 
+---------------------------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-   Group Key: "*VALUES*".column3
-   ->  Sort
-         Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
-         ->  Values Scan on "*VALUES*"
-(6 rows)
+   Sort Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3, "*VALUES*".column2, "*VALUES*".column1
+     Group Key: "*VALUES*".column3
+   ->  Values Scan on "*VALUES*"
+(5 rows)
 
 -- Agg level check. This query should error out.
 select (select grouping(a,b) from gstest2) from gstest2 group by a,b;
@@ -723,13 +721,12 @@ explain (costs off)
             QUERY PLAN            
 ----------------------------------
  GroupAggregate
-   Group Key: a
-   Group Key: ()
+   Sort Key: a
+     Group Key: a
+     Group Key: ()
    Filter: (a IS DISTINCT FROM 1)
-   ->  Sort
-         Sort Key: a
-         ->  Seq Scan on gstest2
-(7 rows)
+   ->  Seq Scan on gstest2
+(6 rows)
 
 select v.c, (select count(*) from gstest2 group by () having v.c)
   from (values (false),(true)) v(c) order by v.c;
@@ -1018,18 +1015,17 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
 explain (costs off)
   select a, b, grouping(a,b), array_agg(v order by v)
     from gstest1 group by cube(a,b);
-                        QUERY PLAN                        
-----------------------------------------------------------
+                      QUERY PLAN                       
+-------------------------------------------------------
  GroupAggregate
-   Group Key: "*VALUES*".column1, "*VALUES*".column2
-   Group Key: "*VALUES*".column1
-   Group Key: ()
+   Sort Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1, "*VALUES*".column2
+     Group Key: "*VALUES*".column1
+     Group Key: ()
    Sort Key: "*VALUES*".column2
      Group Key: "*VALUES*".column2
-   ->  Sort
-         Sort Key: "*VALUES*".column1, "*VALUES*".column2
-         ->  Values Scan on "*VALUES*"
-(9 rows)
+   ->  Values Scan on "*VALUES*"
+(8 rows)
 
 -- unsortable cases
 select unsortable_col, count(*)
@@ -1071,11 +1067,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: unsortable_col
-         Group Key: unhashable_col
-         ->  Sort
-               Sort Key: unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: unhashable_col
+           Group Key: unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 select unhashable_col, unsortable_col,
        grouping(unhashable_col, unsortable_col),
@@ -1114,11 +1109,10 @@ explain (costs off)
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
          Hash Key: v, unsortable_col
-         Group Key: v, unhashable_col
-         ->  Sort
-               Sort Key: v, unhashable_col
-               ->  Seq Scan on gstest4
-(8 rows)
+         Sort Key: v, unhashable_col
+           Group Key: v, unhashable_col
+         ->  Seq Scan on gstest4
+(7 rows)
 
 -- empty input: first is 0 rows, second 1, third 3 etc.
 select a, b, sum(v), count(*) from gstest_empty group by grouping sets ((a,b),a);
@@ -1366,19 +1360,18 @@ explain (costs off)
 BEGIN;
 SET LOCAL enable_hashagg = false;
 EXPLAIN (COSTS OFF) SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
-              QUERY PLAN               
----------------------------------------
+           QUERY PLAN            
+---------------------------------
  Sort
    Sort Key: a, b
    ->  GroupAggregate
-         Group Key: a
-         Group Key: ()
+         Sort Key: a
+           Group Key: a
+           Group Key: ()
          Sort Key: b
            Group Key: b
-         ->  Sort
-               Sort Key: a
-               ->  Seq Scan on gstest3
-(10 rows)
+         ->  Seq Scan on gstest3
+(9 rows)
 
 SELECT a, b, count(*), max(a), max(b) FROM gstest3 GROUP BY GROUPING SETS(a, b,()) ORDER BY a, b;
  a | b | count | max | max 
@@ -1549,22 +1542,21 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(13 rows)
+   ->  Seq Scan on tenk1
+(12 rows)
 
 explain (costs off)
   select unique1,
@@ -1572,18 +1564,17 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+       QUERY PLAN        
+-------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
-   Group Key: unique1
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(9 rows)
+   Sort Key: unique1
+     Group Key: unique1
+   ->  Seq Scan on tenk1
+(8 rows)
 
 set work_mem = '384kB';
 explain (costs off)
@@ -1592,21 +1583,20 @@ explain (costs off)
          count(hundred), count(thousand), count(twothousand),
          count(*)
     from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two);
-          QUERY PLAN           
--------------------------------
+         QUERY PLAN         
+----------------------------
  MixedAggregate
    Hash Key: two
    Hash Key: four
    Hash Key: ten
    Hash Key: hundred
    Hash Key: thousand
-   Group Key: unique1
+   Sort Key: unique1
+     Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
-   ->  Sort
-         Sort Key: unique1
-         ->  Seq Scan on tenk1
-(12 rows)
+   ->  Seq Scan on tenk1
+(11 rows)
 
 -- check collation-sensitive matching between grouping expressions
 -- (similar to a check for aggregates, but there are additional code
@@ -1648,23 +1638,22 @@ select g100, g10, sum(g::numeric), count(*), max(g::text) from
   (select g%1000 as g1000, g%100 as g100, g%10 as g10, g
    from generate_series(0,1999) g) s
 group by cube (g1000, g100,g10);
-                          QUERY PLAN                           
----------------------------------------------------------------
+                      QUERY PLAN                      
+------------------------------------------------------
  GroupAggregate
-   Group Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
-   Group Key: ((g.g % 1000)), ((g.g % 100))
-   Group Key: ((g.g % 1000))
-   Group Key: ()
-   Sort Key: ((g.g % 100)), ((g.g % 10))
-     Group Key: ((g.g % 100)), ((g.g % 10))
-     Group Key: ((g.g % 100))
-   Sort Key: ((g.g % 10)), ((g.g % 1000))
-     Group Key: ((g.g % 10)), ((g.g % 1000))
-     Group Key: ((g.g % 10))
-   ->  Sort
-         Sort Key: ((g.g % 1000)), ((g.g % 100)), ((g.g % 10))
-         ->  Function Scan on generate_series g
-(14 rows)
+   Sort Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 1000), (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 1000), (g.g % 100)
+     Group Key: (g.g % 1000)
+     Group Key: ()
+   Sort Key: (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 100), (g.g % 10)
+     Group Key: (g.g % 100)
+   Sort Key: (g.g % 10), (g.g % 1000)
+     Group Key: (g.g % 10), (g.g % 1000)
+     Group Key: (g.g % 10)
+   ->  Function Scan on generate_series g
+(13 rows)
 
 create table gs_group_1 as
 select g100, g10, sum(g::numeric), count(*), max(g::text) from
-- 
1.8.3.1

0002-fixes.patchapplication/octet-stream; name=0002-fixes.patchDownload

From 122020711d05e1e261a36343c2966d0d87360739 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tv@fuzzy.cz>
Date: Thu, 19 Mar 2020 01:20:58 +0100
Subject: [PATCH 2/5] fixes

---
 src/backend/executor/nodeAgg.c          |  3 +--
 src/backend/optimizer/plan/createplan.c | 15 ++++++++---
 src/backend/optimizer/plan/planner.c    | 47 ++++++++++++++++++++++++++-------
 3 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 0a63980..b02431c 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -517,7 +517,7 @@ initialize_phase(AggState *aggstate, int newphase)
 	 */
 	if (newphase > 0 && newphase < aggstate->numphases - 1)
 	{
-		Sort	   *sortnode = (Sort *)aggstate->phases[newphase + 1].aggnode->sortnode;
+		Sort	   *sortnode = (Sort *) aggstate->phases[newphase + 1].aggnode->sortnode;
 		PlanState  *outerNode = outerPlanState(aggstate);
 		TupleDesc	tupDesc = ExecGetResultType(outerNode);
 
@@ -4682,7 +4682,6 @@ ExecReScanAgg(AggState *node)
 		node->projected_set = -1;
 	}
 
-
 	if (outerPlan->chgParam == NULL)
 		ExecReScan(outerPlan);
 }
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index d5b3408..7c29f89 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2171,10 +2171,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 	/*
 	 * Agg can project, so no need to be terribly picky about child tlist, but
-	 * we do need grouping columns to be available; If the groupingsets need
+	 * we do need grouping columns to be available. If the groupingsets need
 	 * to sort the input, the agg will store the input rows in a tuplesort,
-	 * it therefore behooves us to request a small tlist to avoid wasting
-	 * spaces.
+	 * so we request a small tlist to avoid wasting space.
 	 */
 	if (!best_path->is_sorted)
 		flags = flags | CP_SMALL_TLIST;
@@ -2239,6 +2238,11 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 			new_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
+			/*
+			 * If it's the first rollup using sorted mode, add an explicit sort
+			 * node if the input is not sorted yet, for other rollups using
+			 * sorted mode, always add an explicit sort.
+			 */
 			if (!rollup->is_hashed)
 			{
 				if (!is_first_sort ||
@@ -2297,7 +2301,10 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 
 		top_grpColIdx = remap_groupColIdx(root, rollup->groupClause);
 
-		/* the input is not sorted yet */
+		/*
+		 * When the rollup uses sorted mode, and the input is not already sorted,
+		 * add an explicit sort.
+		 */
 		if (!rollup->is_hashed &&
 			!best_path->is_sorted)
 		{
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6110f38..0cab951 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4185,13 +4185,22 @@ create_ordinary_grouping_paths(PlannerInfo *root, RelOptInfo *input_rel,
  * times, so it's important that it not scribble on input.  No result is
  * returned, but any generated paths are added to grouped_rel.
  *
- * - strat:
- *   preferred aggregate strategy to use.
- * 
- * - is_sorted:
- *   Is the input sorted on the groupCols of the first rollup. Caller
- *   must set it correctly if strat is set to AGG_SORTED, the planner
- *   uses it to generate a sortnode.
+ * The caller specifies the preferred aggregate strategy (sorted or hashed) using
+ * the strat aprameter. When the requested strategy is AGG_SORTED, the input path
+ * needs to be sorted accordingly (is_sorted needs to be true).
+ *
+ * Pengzhou: is_sorted is acutally a hint here, the callers prefer to use
+ * AGG_SORTED are not forced to add an explicit sort path before calling
+ * this function now. please see comments in callers
+ *
+ * Ideally, consider_groupingsets_paths() should check whether the input is
+ * sorted or not, however, the callers prefer using AGG_SORTED is forced to
+ * check is_sorted already (to see whether a non-cheapest-path is worth
+ * considering), so consider_groupingsets_paths() don't need to check it again. 
+ * for callers prefer AGG_HASHED, is_sorted is never checked, they only consider
+ * the cheapest path, but the cheapest path can also be already sorted
+ * coincidentally, that's why AGG_MIZED is choosen when strat is specified
+ * to AGG_HASHED.
  */
 static void
 consider_groupingsets_paths(PlannerInfo *root,
@@ -4259,7 +4268,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 			unhashed_rollup = lfirst_node(RollupData, l_start);
 			exclude_groups = unhashed_rollup->numGroups;
 			l_start = lnext(gd->rollups, l_start);
-			/* update is_sorted to true */
+			/* the input is coincidentally sorted usefully, update is_sorted */
 			is_sorted = true;
 		}
 
@@ -4359,7 +4368,10 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->hashable = false;
 			rollup->is_hashed = false;
 			new_rollups = lappend(new_rollups, rollup);
-			/* update is_sorted to true */
+			/*
+			 * The first non-hashed rollup is PLAIN AGG, is_sorted
+			 * should be true.
+			 */
 			is_sorted = true;
 			strat = AGG_MIXED;
 		}
@@ -4394,6 +4406,9 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 *
 	 * can_hash is passed in as false if some obstacle elsewhere (such as
 	 * ordered aggs) means that we shouldn't consider hashing at all.
+	 *
+	 * XXX This comment seems to be broken by the patch, and it's not very
+	 * clear to me what it tries to say.
 	 */
 	if (can_hash && gd->any_hashable)
 	{
@@ -4445,7 +4460,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 			/*
 			 * We leave the first rollup out of consideration since it's the
-			 * one that need to be sorted.  We assign indexes "i"
+			 * one that matches the input sort order.  We assign indexes "i"
 			 * to only those entries considered for hashing; the second loop,
 			 * below, must use the same condition.
 			 */
@@ -6419,6 +6434,18 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 											  path->pathkeys);
 			if (path == cheapest_path || is_sorted)
 			{
+				/* XXX Why do we do it before possibly adding an explicit sort on top? */
+				/*
+				 * Pengzhou: this patch intend to let each sorted aggregate phases
+				 * do their own sorting include the first phase, so in the final
+				 * stage of parallel grouping sets, the tuples is put into temp
+				 * storage of each sorted phase and then each sorted phase do
+				 * its own sorting one by one. 
+				 * Add a explicit sort path underneath the main Agg node will
+				 * make tuples from all groupingsets sorted using the sort key
+				 * of the first phase, it is not right.
+				 *
+				 */
 				if (parse->groupingSets)
 				{
 					/* consider AGG_SORTED strategy */
-- 
1.8.3.1

v1-0004-Reorganise-the-aggregate-phases.patchapplication/octet-stream; name=v1-0004-Reorganise-the-aggregate-phases.patchDownload

From f869146af129fa348d78eefe6bad737cbc550cb3 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:13:44 -0400
Subject: [PATCH 4/5] Reorganise the aggregate phases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This commit is a preparing step to support parallel grouping sets.

When planning, PG used to organize the grouping sets in [HASHED] -> [SORTED]
order which means HASHED aggregates were always located before SORTED aggregate,
when initializing AGG node, PG also organized the aggregate phases in
[HASHED]->[SORTED] order, all HASHED grouping sets were squeezed to the phase 0,
when executing AGG node, if followed AGG_SORTED or AGG_MIXED strategy, the
executor will start from phase1 -> phases2-> phases3 then phase0 if it is an
AGG_MIXED strategy. This bothers a lot when adding the support for parallel
grouping sets, firstly, we need complicated logic to locate the first sort
rollup/phase and handle the special order for a different strategy in many
places, Secondly, squeezing all hashed grouping sets to phase 0 is not working
for parallel grouping sets, we can not put all hash transition functions to one
expression state in the final stage.

This commit organizes the grouping sets in a more natural order: [SORTED]->[HASHED]
and the HASHED sets are no longer squeezed to a single phase, we use another way
to put all hash transitions to the first phase's expression state, the executor
now starts execution from phase0 for all strategies.

This commit also move 'sort_in' from AggState to AggStatePerPhase* structure,
this helps to handle more complicated cases when parallel groupingsets is
introduced, we might need to add a tuplestore 'store_in' to store partial
aggregates results for PLAIN sets then.

This commit also make the hash spill refill logic clear and avoid using
nullcheck when refilling the hashtable.
---
 contrib/postgres_fdw/expected/postgres_fdw.out    |   4 +-
 src/backend/commands/explain.c                    |   2 +-
 src/backend/executor/execExpr.c                   |  57 +-
 src/backend/executor/execExprInterp.c             |  30 +-
 src/backend/executor/nodeAgg.c                    | 974 ++++++++++++----------
 src/backend/jit/llvm/llvmjit_expr.c               |  51 +-
 src/backend/optimizer/plan/createplan.c           |  29 +-
 src/backend/optimizer/plan/planner.c              |   9 +-
 src/backend/optimizer/util/pathnode.c             |  65 +-
 src/include/executor/execExpr.h                   |   5 +-
 src/include/executor/executor.h                   |   2 +-
 src/include/executor/nodeAgg.h                    |  34 +-
 src/include/nodes/execnodes.h                     |  22 +-
 src/test/regress/expected/groupingsets.out        |  40 +-
 src/test/regress/expected/partition_aggregate.out |   2 +-
 15 files changed, 681 insertions(+), 645 deletions(-)

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 62c2697..fc0ed2f 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -3448,8 +3448,8 @@ select c2, sum(c1) from ft1 where c2 < 3 group by rollup(c2) order by 1 nulls la
    Sort Key: ft1.c2
    ->  MixedAggregate
          Output: c2, sum(c1)
-         Hash Key: ft1.c2
          Group Key: ()
+         Hash Key: ft1.c2
          ->  Foreign Scan on public.ft1
                Output: c2, c1
                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
@@ -3473,8 +3473,8 @@ select c2, sum(c1) from ft1 where c2 < 3 group by cube(c2) order by 1 nulls last
    Sort Key: ft1.c2
    ->  MixedAggregate
          Output: c2, sum(c1)
-         Hash Key: ft1.c2
          Group Key: ()
+         Hash Key: ft1.c2
          ->  Foreign Scan on public.ft1
                Output: c2, c1
                Remote SQL: SELECT "C 1", c2 FROM "S 1"."T 1" WHERE ((c2 < 3))
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 6914d18..7486d4b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2319,7 +2319,7 @@ show_grouping_set_keys(PlanState *planstate,
 	const char *keyname;
 	const char *keysetname;
 
-	if (aggnode->aggstrategy == AGG_HASHED || aggnode->aggstrategy == AGG_MIXED)
+	if (aggnode->aggstrategy == AGG_HASHED)
 	{
 		keyname = "Hash Key";
 		keysetname = "Hash Keys";
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 1370ffe..3533f5c 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -80,7 +80,7 @@ static void ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-								  int transno, int setno, int setoff, bool ishash,
+								  int transno, int setno, AggStatePerPhase perphase,
 								  bool nullcheck);
 
 
@@ -2931,13 +2931,13 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  * the array of AggStatePerGroup, and skip evaluation if so.
  */
 ExprState *
-ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase, bool nullcheck, bool allow_concurrent_hashing)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
+	ListCell   *lc;
 	LastAttnumInfo deform = {0, 0, 0};
 
 	state->expr = (Expr *) aggstate;
@@ -2978,6 +2978,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		NullableDatum *strictargs = NULL;
 		bool	   *strictnulls = NULL;
 		int			argno;
+		int			setno;
 		ListCell   *bail;
 
 		/*
@@ -3155,37 +3156,27 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * grouping set). Do so for both sort and hash based computations, as
 		 * applicable.
 		 */
-		if (doSort)
+		for (setno = 0; setno < phase->numsets; setno++)
 		{
-			int			processGroupingSets = Max(phase->numsets, 1);
-			int			setoff = 0;
-
-			for (int setno = 0; setno < processGroupingSets; setno++)
-			{
-				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
-				setoff++;
-			}
+			ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
+								  pertrans, transno, setno, phase, nullcheck);
 		}
 
-		if (doHash)
+		/*
+		 * Call transition function for HASHED aggs that can be
+		 * advanced concurrently.
+		 */
+		if (allow_concurrent_hashing &&
+			phase->concurrent_hashes)
 		{
-			int			numHashes = aggstate->num_hashes;
-			int			setoff;
-
-			/* in MIXED mode, there'll be preceding transition values */
-			if (aggstate->aggstrategy != AGG_HASHED)
-				setoff = aggstate->maxsets;
-			else
-				setoff = 0;
-
-			for (int setno = 0; setno < numHashes; setno++)
+			foreach(lc, phase->concurrent_hashes)
 			{
+				AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) lfirst(lc);
+
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
-									  pertrans, transno, setno, setoff, true,
+									  pertrans, transno, 0,
+									  (AggStatePerPhase) perhash,
 									  nullcheck);
-				setoff++;
 			}
 		}
 
@@ -3234,14 +3225,17 @@ static void
 ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
-					  int transno, int setno, int setoff, bool ishash,
+					  int transno, int setno, AggStatePerPhase perphase,
 					  bool nullcheck)
 {
 	ExprContext *aggcontext;
 	int adjust_jumpnull = -1;
 
-	if (ishash)
+	if (perphase->is_hashed)
+	{
+		Assert(setno == 0);
 		aggcontext = aggstate->hashcontext;
+	}
 	else
 		aggcontext = aggstate->aggcontexts[setno];
 
@@ -3249,9 +3243,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	if (nullcheck)
 	{
 		scratch->opcode = EEOP_AGG_PLAIN_PERGROUP_NULLCHECK;
-		scratch->d.agg_plain_pergroup_nullcheck.setoff = setoff;
+		scratch->d.agg_plain_pergroup_nullcheck.pergroups = perphase->pergroups;
 		/* adjust later */
 		scratch->d.agg_plain_pergroup_nullcheck.jumpnull = -1;
+		scratch->d.agg_plain_pergroup_nullcheck.setno = setno;
 		ExprEvalPushStep(state, scratch);
 		adjust_jumpnull = state->steps_len - 1;
 	}
@@ -3319,7 +3314,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 
 	scratch->d.agg_trans.pertrans = pertrans;
 	scratch->d.agg_trans.setno = setno;
-	scratch->d.agg_trans.setoff = setoff;
+	scratch->d.agg_trans.pergroups = perphase->pergroups;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
 	ExprEvalPushStep(state, scratch);
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 113ed15..b0dbba4 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -1610,9 +1610,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 
 		EEO_CASE(EEOP_AGG_PLAIN_PERGROUP_NULLCHECK)
 		{
-			AggState   *aggstate = castNode(AggState, state->parent);
-			AggStatePerGroup pergroup_allaggs = aggstate->all_pergroups
-				[op->d.agg_plain_pergroup_nullcheck.setoff];
+			AggStatePerGroup pergroup_allaggs =
+				op->d.agg_plain_pergroup_nullcheck.pergroups
+				[op->d.agg_plain_pergroup_nullcheck.setno];
 
 			if (pergroup_allaggs == NULL)
 				EEO_JUMP(op->d.agg_plain_pergroup_nullcheck.jumpnull);
@@ -1636,8 +1636,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1665,8 +1665,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1684,8 +1684,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(pertrans->transtypeByVal);
@@ -1702,8 +1702,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1724,8 +1724,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
@@ -1742,8 +1742,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			AggState   *aggstate = castNode(AggState, state->parent);
 			AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
-			AggStatePerGroup pergroup = &aggstate->all_pergroups
-				[op->d.agg_trans.setoff]
+			AggStatePerGroup pergroup = &op->d.agg_trans.pergroups
+				[op->d.agg_trans.setno]
 				[op->d.agg_trans.transno];
 
 			Assert(!pertrans->transtypeByVal);
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index b4d652f..3287ed4 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -250,6 +250,7 @@
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
 #include "nodes/nodeFuncs.h"
+#include "nodes/pathnodes.h"
 #include "optimizer/optimizer.h"
 #include "parser/parse_agg.h"
 #include "parser/parse_coerce.h"
@@ -348,13 +349,39 @@ typedef struct HashAggSpill
  */
 typedef struct HashAggBatch
 {
-	int				 setno;			/* grouping set */
+	int				 phaseidx;		/* phase that own this batch */
 	int				 used_bits;		/* number of bits of hash already used */
 	LogicalTapeSet	*tapeset;		/* borrowed reference to tape set */
 	int				 input_tapenum;	/* input partition tape */
 	int64			 input_tuples;	/* number of tuples in this batch */
 } HashAggBatch;
 
+/*
+ * Represents different stages of hash aggregate.
+ *
+ * HASHAGG_INITIAL: initial stage for hash aggregate, allow to do all hash
+ * transitions in one expression, get input tuples from outer node and all
+ * hash entries are filled. 
+ *
+ * HASHAGG_SPILL: enter hash spill mode, allow to do all hash transitions
+ * in one expression, still get input tuples from outer node but some hash
+ * entries might not be filled, so a null check is added into the transitions
+ * expression.
+ *
+ * HASHAGG_REFILL: refilling the hash table group set by group set, disallow
+ * doing all hash transitions in one expression, get input tuples from spill
+ * files, only one hash entry is filled, we may reenter hash spill mode when
+ * refilling the hash table, but the transition expression is not called if
+ * the hash entry is not filled, so null check is not added in the transition
+ * expression.
+ */
+typedef enum HashAggStage
+{
+	HASHAGG_INITIAL = 0,
+	HASHAGG_SPILL,
+	HASHAGG_REFILL,
+} HashAggStage;
+
 static void select_current_set(AggState *aggstate, int setno, bool is_hash);
 static void initialize_phase(AggState *aggstate, int newphase);
 static TupleTableSlot *fetch_input_tuple(AggState *aggstate);
@@ -379,7 +406,7 @@ static void finalize_partialaggregate(AggState *aggstate,
 									  AggStatePerAgg peragg,
 									  AggStatePerGroup pergroupstate,
 									  Datum *resultVal, bool *resultIsNull);
-static void prepare_hash_slot(AggState *aggstate);
+static void prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash);
 static void prepare_projection_slot(AggState *aggstate,
 									TupleTableSlot *slot,
 									int currentSet);
@@ -390,9 +417,9 @@ static TupleTableSlot *project_aggregates(AggState *aggstate);
 static Bitmapset *find_unaggregated_cols(AggState *aggstate);
 static bool find_unaggregated_cols_walker(Node *node, Bitmapset **colnos);
 static void build_hash_tables(AggState *aggstate);
-static void build_hash_table(AggState *aggstate, int setno, long nbuckets);
-static void hashagg_recompile_expressions(AggState *aggstate, bool minslot,
-										  bool nullcheck);
+static void build_hash_table(AggState *aggstate,
+							 AggStatePerPhaseHash perhash, long nbuckets);
+static void hashagg_recompile_expressions(AggState *aggstate);
 static long hash_choose_num_buckets(double hashentrysize,
 									long estimated_nbuckets,
 									Size memory);
@@ -400,13 +427,17 @@ static int hash_choose_num_partitions(uint64 input_groups,
 									  double hashentrysize,
 									  int used_bits,
 									  int *log2_npartittions);
-static AggStatePerGroup lookup_hash_entry(AggState *aggstate, uint32 hash,
+static AggStatePerGroup lookup_hash_entry(AggState *aggstate,
+										  AggStatePerPhaseHash perhash,
+										  uint32 hash,
 										  bool *in_hash_table);
-static void lookup_hash_entries(AggState *aggstate);
+static void lookup_hash_entries(AggState *aggstate,
+								AggStatePerPhaseHash current_hash,
+								List *concurrent_hashes);
 static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
-static void agg_sort_input(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static bool agg_refill_hash_table(AggState *aggstate);
+static void agg_sort_input(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
 static void hash_agg_check_limits(AggState *aggstate);
@@ -416,7 +447,7 @@ static void hash_agg_update_metrics(AggState *aggstate, bool from_tape,
 static void hashagg_finish_initial_spills(AggState *aggstate);
 static void hashagg_reset_spill_state(AggState *aggstate);
 static HashAggBatch *hashagg_batch_new(LogicalTapeSet *tapeset,
-									   int input_tapenum, int setno,
+									   int input_tapenum, int phaseidx,
 									   int64 input_tuples, int used_bits);
 static MinimalTuple hashagg_batch_read(HashAggBatch *batch, uint32 *hashp);
 static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
@@ -425,7 +456,7 @@ static void hashagg_spill_init(HashAggSpill *spill, HashTapeInfo *tapeinfo,
 static Size hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot,
 								uint32 hash);
 static void hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill,
-								 int setno);
+								 int phaseidx);
 static void hashagg_tapeinfo_init(AggState *aggstate);
 static void hashagg_tapeinfo_assign(HashTapeInfo *tapeinfo, int *dest,
 									int ndest);
@@ -459,7 +490,10 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 	 * ExecAggPlainTransByRef().
 	 */
 	if (is_hash)
+	{
+		Assert(setno == 0);
 		aggstate->curaggcontext = aggstate->hashcontext;
+	}
 	else
 		aggstate->curaggcontext = aggstate->aggcontexts[setno];
 
@@ -467,72 +501,73 @@ select_current_set(AggState *aggstate, int setno, bool is_hash)
 }
 
 /*
- * Switch to phase "newphase", which must either be 0 or 1 (to reset) or
+ * Switch to phase "newphase", which must either be 0 (to reset) or
  * current_phase + 1. Juggle the tuplesorts accordingly.
- *
- * Phase 0 is for hashing, which we currently handle last in the AGG_MIXED
- * case, so when entering phase 0, all we need to do is drop open sorts.
  */
 static void
 initialize_phase(AggState *aggstate, int newphase)
 {
-	Assert(newphase <= 1 || newphase == aggstate->current_phase + 1);
+	AggStatePerPhase current_phase;
+	AggStatePerPhaseSort persort;
+
+	/* Don't use aggstate->phase here, it might not be initialized yet*/
+	current_phase = aggstate->phases[aggstate->current_phase];
 
 	/*
 	 * Whatever the previous state, we're now done with whatever input
-	 * tuplesort was in use.
+	 * tuplesort was in use, cleanup them.
+	 *
+	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * rescan in some cases without resorting the input again.
 	 */
-	if (aggstate->sort_in)
-	{
-		tuplesort_end(aggstate->sort_in);
-		aggstate->sort_in = NULL;
-	}
-
-	if (newphase <= 1)
+	if (!current_phase->is_hashed && aggstate->current_phase > 0)
 	{
-		/*
-		 * Discard any existing output tuplesort.
-		 */
-		if (aggstate->sort_out)
+		persort = (AggStatePerPhaseSort) current_phase;
+		if (persort->sort_in)
 		{
-			tuplesort_end(aggstate->sort_out);
-			aggstate->sort_out = NULL;
+			tuplesort_end(persort->sort_in);
+			persort->sort_in = NULL;
 		}
 	}
-	else
-	{
-		/*
-		 * The old output tuplesort becomes the new input one, and this is the
-		 * right time to actually sort it.
-		 */
-		aggstate->sort_in = aggstate->sort_out;
-		aggstate->sort_out = NULL;
-		Assert(aggstate->sort_in);
-		tuplesort_performsort(aggstate->sort_in);
-	}
+
+	/* advance to next phase */
+	aggstate->current_phase = newphase;
+	aggstate->phase = aggstate->phases[newphase];
+
+	if (aggstate->phase->is_hashed)
+		return;
+
+	/* New phase is not hashed */
+	persort = (AggStatePerPhaseSort) aggstate->phase;
+
+	/* This is the right time to actually sort it. */
+	if (persort->sort_in)
+		tuplesort_performsort(persort->sort_in);
 
 	/*
-	 * If this isn't the last phase, we need to sort appropriately for the
+	 * If copy_out is set, we need to sort appropriately for the
 	 * next phase in sequence.
 	 */
-	if (newphase > 0 && newphase < aggstate->numphases - 1)
+	if (persort->copy_out)
 	{
-		Sort	   *sortnode = (Sort *) aggstate->phases[newphase + 1].aggnode->sortnode;
-		PlanState  *outerNode = outerPlanState(aggstate);
-		TupleDesc	tupDesc = ExecGetResultType(outerNode);
-
-		aggstate->sort_out = tuplesort_begin_heap(tupDesc,
-												  sortnode->numCols,
-												  sortnode->sortColIdx,
-												  sortnode->sortOperators,
-												  sortnode->collations,
-												  sortnode->nullsFirst,
-												  work_mem,
-												  NULL, false);
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[newphase + 1];
+		Sort *sortnode = (Sort *) next->phasedata.aggnode->sortnode;
+		PlanState *outerNode = outerPlanState(aggstate);
+		TupleDesc tupDesc = ExecGetResultType(outerNode);
+
+		Assert(!next->phasedata.is_hashed);
+
+		if (!next->sort_in)
+			next->sort_in = tuplesort_begin_heap(tupDesc,
+												 sortnode->numCols,
+												 sortnode->sortColIdx,
+												 sortnode->sortOperators,
+												 sortnode->collations,
+												 sortnode->nullsFirst,
+												 work_mem,
+												 NULL, false);
 	}
-
-	aggstate->current_phase = newphase;
-	aggstate->phase = &aggstate->phases[newphase];
 }
 
 /*
@@ -547,12 +582,16 @@ static TupleTableSlot *
 fetch_input_tuple(AggState *aggstate)
 {
 	TupleTableSlot *slot;
+	AggStatePerPhaseSort current_phase;
 
-	if (aggstate->sort_in)
+	Assert(!aggstate->phase->is_hashed);
+	current_phase = (AggStatePerPhaseSort) aggstate->phase;
+
+	if (current_phase->sort_in)
 	{
 		/* make sure we check for interrupts in either path through here */
 		CHECK_FOR_INTERRUPTS();
-		if (!tuplesort_gettupleslot(aggstate->sort_in, true, false,
+		if (!tuplesort_gettupleslot(current_phase->sort_in, true, false,
 									aggstate->sort_slot, NULL))
 			return NULL;
 		slot = aggstate->sort_slot;
@@ -560,8 +599,13 @@ fetch_input_tuple(AggState *aggstate)
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
-	if (!TupIsNull(slot) && aggstate->sort_out)
-		tuplesort_puttupleslot(aggstate->sort_out, slot);
+	if (!TupIsNull(slot) && current_phase->copy_out)
+	{
+		AggStatePerPhaseSort next =
+			(AggStatePerPhaseSort) aggstate->phases[aggstate->current_phase + 1];
+		Assert(!next->phasedata.is_hashed);
+		tuplesort_puttupleslot(next->sort_in, slot);
+	}
 
 	return slot;
 }
@@ -667,7 +711,7 @@ initialize_aggregates(AggState *aggstate,
 					  int numReset)
 {
 	int			transno;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			setno = 0;
 	int			numTrans = aggstate->numtrans;
 	AggStatePerTrans transstates = aggstate->pertrans;
@@ -1195,10 +1239,9 @@ finalize_partialaggregate(AggState *aggstate,
  * hashslot. This is necessary to compute the hash or perform a lookup.
  */
 static void
-prepare_hash_slot(AggState *aggstate)
+prepare_hash_slot(AggState *aggstate, AggStatePerPhaseHash perhash)
 {
 	TupleTableSlot *inputslot = aggstate->tmpcontext->ecxt_outertuple;
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	int				i;
 
@@ -1432,29 +1475,33 @@ find_unaggregated_cols_walker(Node *node, Bitmapset **colnos)
 static void
 build_hash_tables(AggState *aggstate)
 {
-	int				setno;
+	int	phaseidx;
 
-	for (setno = 0; setno < aggstate->num_hashes; ++setno)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[setno];
+		AggStatePerPhaseHash perhash;
+		AggStatePerPhase phase = aggstate->phases[phaseidx];
 		long			nbuckets;
 		Size			memory;
 
+		if (!phase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) phase;
+
 		if (perhash->hashtable != NULL)
 		{
 			ResetTupleHashTable(perhash->hashtable);
 			continue;
 		}
 
-		Assert(perhash->aggnode->numGroups > 0);
-
 		memory = aggstate->hash_mem_limit / aggstate->num_hashes;
 
 		/* choose reasonable number of buckets per hashtable */
 		nbuckets = hash_choose_num_buckets(
-			aggstate->hashentrysize, perhash->aggnode->numGroups, memory);
+			aggstate->hashentrysize, phase->aggnode->numGroups, memory);
 
-		build_hash_table(aggstate, setno, nbuckets);
+		build_hash_table(aggstate, perhash, nbuckets);
 	}
 
 	aggstate->hash_ngroups_current = 0;
@@ -1464,9 +1511,8 @@ build_hash_tables(AggState *aggstate)
  * Build a single hashtable for this grouping set.
  */
 static void
-build_hash_table(AggState *aggstate, int setno, long nbuckets)
+build_hash_table(AggState *aggstate, AggStatePerPhaseHash perhash, long nbuckets)
 {
-	AggStatePerHash perhash = &aggstate->perhash[setno];
 	MemoryContext	metacxt = aggstate->hash_metacxt;
 	MemoryContext	hashcxt = aggstate->hashcontext->ecxt_per_tuple_memory;
 	MemoryContext	tmpcxt	= aggstate->tmpcontext->ecxt_per_tuple_memory;
@@ -1490,7 +1536,7 @@ build_hash_table(AggState *aggstate, int setno, long nbuckets)
 		perhash->hashGrpColIdxHash,
 		perhash->eqfuncoids,
 		perhash->hashfunctions,
-		perhash->aggnode->grpCollations,
+		perhash->phasedata.aggnode->grpCollations,
 		nbuckets,
 		additionalsize,
 		metacxt,
@@ -1529,23 +1575,29 @@ find_hash_columns(AggState *aggstate)
 {
 	Bitmapset  *base_colnos;
 	List	   *outerTlist = outerPlanState(aggstate)->plan->targetlist;
-	int			numHashes = aggstate->num_hashes;
 	EState	   *estate = aggstate->ss.ps.state;
 	int			j;
 
 	/* Find Vars that will be needed in tlist and qual */
 	base_colnos = find_unaggregated_cols(aggstate);
 
-	for (j = 0; j < numHashes; ++j)
+	for (j = 0; j < aggstate->numphases; ++j)
 	{
-		AggStatePerHash perhash = &aggstate->perhash[j];
+		AggStatePerPhase perphase = aggstate->phases[j];
+		AggStatePerPhaseHash perhash;
 		Bitmapset  *colnos = bms_copy(base_colnos);
-		AttrNumber *grpColIdx = perhash->aggnode->grpColIdx;
+		Bitmapset  *grouped_cols = perphase->grouped_cols[0];
+		AttrNumber *grpColIdx = perphase->aggnode->grpColIdx;
 		List	   *hashTlist = NIL;
+		ListCell   *lc;
 		TupleDesc	hashDesc;
 		int			maxCols;
 		int			i;
 
+		if (!perphase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) perphase;
 		perhash->largestGrpColIdx = 0;
 
 		/*
@@ -1555,18 +1607,12 @@ find_hash_columns(AggState *aggstate)
 		 * there'd be no point storing them.  Use prepare_projection_slot's
 		 * logic to determine which.
 		 */
-		if (aggstate->phases[0].grouped_cols)
+		foreach(lc, aggstate->all_grouped_cols)
 		{
-			Bitmapset  *grouped_cols = aggstate->phases[0].grouped_cols[j];
-			ListCell   *lc;
-
-			foreach(lc, aggstate->all_grouped_cols)
-			{
-				int			attnum = lfirst_int(lc);
+			int			attnum = lfirst_int(lc);
 
-				if (!bms_is_member(attnum, grouped_cols))
-					colnos = bms_del_member(colnos, attnum);
-			}
+			if (!bms_is_member(attnum, grouped_cols))
+				colnos = bms_del_member(colnos, attnum);
 		}
 
 		/*
@@ -1622,7 +1668,7 @@ find_hash_columns(AggState *aggstate)
 		hashDesc = ExecTypeFromTL(hashTlist);
 
 		execTuplesHashPrepare(perhash->numCols,
-							  perhash->aggnode->grpOperators,
+							  perphase->aggnode->grpOperators,
 							  &perhash->eqfuncoids,
 							  &perhash->hashfunctions);
 		perhash->hashslot =
@@ -1669,28 +1715,44 @@ hash_agg_entry_size(int numAggs, Size tupleWidth, Size transitionSpace)
  * expressions in the AggStatePerPhase, and reuse when appropriate.
  */
 static void
-hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
+hashagg_recompile_expressions(AggState *aggstate)
 {
-	AggStatePerPhase		 phase;
-	int						 i = minslot ? 1 : 0;
-	int						 j = nullcheck ? 1 : 0;
+	AggStatePerPhase		 phase = aggstate->phase;
 
 	Assert(aggstate->aggstrategy == AGG_HASHED ||
 		   aggstate->aggstrategy == AGG_MIXED);
 
-	if (aggstate->aggstrategy == AGG_HASHED)
-		phase = &aggstate->phases[0];
-	else /* AGG_MIXED */
-		phase = &aggstate->phases[1];
-
-	if (phase->evaltrans_cache[i][j] == NULL)
+	if (phase->evaltrans_cache[aggstate->hash_agg_stage] == NULL)
 	{
 		const TupleTableSlotOps *outerops	= aggstate->ss.ps.outerops;
-		bool					 outerfixed = aggstate->ss.ps.outeropsfixed;
-		bool					 dohash		= true;
-		bool					 dosort;
+		bool	outerfixed = aggstate->ss.ps.outeropsfixed;
+		bool	minslot = false;
+		int		nullcheck = false;
+		int		allow_concurrent_hashing = true;
 
-		dosort = aggstate->aggstrategy == AGG_MIXED ? true : false;
+		/*
+		 * we are refilling the hash table and we disallow concurrent hashing
+		 * within transition expression because we refill the hash tables one
+		 * set by one set, this can avoid unnecessary nullcheck, meanwhile, we
+		 * get tuple from spill file, so it is a MinimalTuple.
+		 */
+		if (aggstate->hash_agg_stage == HASHAGG_REFILL)
+		{
+			minslot = true;
+			nullcheck = false;
+			allow_concurrent_hashing = false;
+		}
+		/*
+		 * we entred the spill mode, the concurrent hashing still works in this
+		 * mode, but some grouping sets need to put the tuple into spill files
+		 * and their pergroup states will be NULL, so we need add nullcheck.
+		 */
+		else if (aggstate->hash_agg_stage == HASHAGG_SPILL)
+		{
+			minslot = false;
+			nullcheck = true;
+			allow_concurrent_hashing = true;
+		}
 
 		/* temporarily change the outerops while compiling the expression */
 		if (minslot)
@@ -1699,15 +1761,15 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
 			aggstate->ss.ps.outeropsfixed = true;
 		}
 
-		phase->evaltrans_cache[i][j] = ExecBuildAggTrans(
-			aggstate, phase, dosort, dohash, nullcheck);
+		phase->evaltrans_cache[aggstate->hash_agg_stage] =
+			ExecBuildAggTrans(aggstate, phase, nullcheck, allow_concurrent_hashing);
 
 		/* change back */
 		aggstate->ss.ps.outerops = outerops;
 		aggstate->ss.ps.outeropsfixed = outerfixed;
 	}
 
-	phase->evaltrans = phase->evaltrans_cache[i][j];
+	phase->evaltrans = phase->evaltrans_cache[aggstate->hash_agg_stage];
 }
 
 /*
@@ -1804,29 +1866,22 @@ static void
 hash_agg_enter_spill_mode(AggState *aggstate)
 {
 	aggstate->hash_spill_mode = true;
-	hashagg_recompile_expressions(aggstate, aggstate->table_filled, true);
+
+	/* if table_filled is true, we must be refilling the hash table */
+	if (aggstate->table_filled)
+		aggstate->hash_agg_stage = HASHAGG_REFILL;
+	else
+		aggstate->hash_agg_stage = HASHAGG_SPILL;
+
+	hashagg_recompile_expressions(aggstate);
 
 	if (!aggstate->hash_ever_spilled)
 	{
 		Assert(aggstate->hash_tapeinfo == NULL);
-		Assert(aggstate->hash_spills == NULL);
 
 		aggstate->hash_ever_spilled = true;
 
 		hashagg_tapeinfo_init(aggstate);
-
-		aggstate->hash_spills = palloc(
-			sizeof(HashAggSpill) * aggstate->num_hashes);
-
-		for (int setno = 0; setno < aggstate->num_hashes; setno++)
-		{
-			AggStatePerHash	 perhash = &aggstate->perhash[setno];
-			HashAggSpill	*spill	 = &aggstate->hash_spills[setno];
-
-			hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
-							   perhash->aggnode->numGroups,
-							   aggstate->hashentrysize);
-		}
 	}
 }
 
@@ -1974,9 +2029,9 @@ hash_choose_num_partitions(uint64 input_groups, double hashentrysize,
  * spill it to disk.
  */
 static AggStatePerGroup
-lookup_hash_entry(AggState *aggstate, uint32 hash, bool *in_hash_table)
+lookup_hash_entry(AggState *aggstate, AggStatePerPhaseHash perhash,
+				  uint32 hash, bool *in_hash_table)
 {
-	AggStatePerHash perhash = &aggstate->perhash[aggstate->current_set];
 	TupleTableSlot *hashslot = perhash->hashslot;
 	TupleHashEntryData *entry;
 	bool			isnew = false;
@@ -2050,34 +2105,48 @@ lookup_hash_entry(AggState *aggstate, uint32 hash, bool *in_hash_table)
  * efficient.
  */
 static void
-lookup_hash_entries(AggState *aggstate)
+lookup_hash_entries(AggState *aggstate, AggStatePerPhaseHash current_hash,
+					List *concurrent_hashes)
 {
-	AggStatePerGroup *pergroup = aggstate->hash_pergroup;
-	int			setno;
+	AggStatePerPhaseHash perhash;
+	int		i;
 
-	for (setno = 0; setno < aggstate->num_hashes; setno++)
+	for (i = 0; i < 1 + list_length(concurrent_hashes); i++)
 	{
-		AggStatePerHash	perhash = &aggstate->perhash[setno];
 		uint32			hash;
 		bool			in_hash_table;
 
-		select_current_set(aggstate, setno, true);
-		prepare_hash_slot(aggstate);
+		if (i == 0)
+			perhash = current_hash;
+		else
+			perhash = (AggStatePerPhaseHash) list_nth(concurrent_hashes, i - 1);
+
+		if (!perhash)
+			continue;
+
+		select_current_set(aggstate, 0, true);
+		prepare_hash_slot(aggstate, perhash);
 		hash = TupleHashTableHash(perhash->hashtable, perhash->hashslot);
-		pergroup[setno] = lookup_hash_entry(aggstate, hash, &in_hash_table);
+		perhash->phasedata.pergroups[0] =
+			lookup_hash_entry(aggstate, perhash, hash, &in_hash_table);
 
 		/* check to see if we need to spill the tuple for this grouping set */
 		if (!in_hash_table)
 		{
-			HashAggSpill	*spill	 = &aggstate->hash_spills[setno];
 			TupleTableSlot	*slot	 = aggstate->tmpcontext->ecxt_outertuple;
 
-			if (spill->partitions == NULL)
-				hashagg_spill_init(spill, aggstate->hash_tapeinfo, 0,
-								   perhash->aggnode->numGroups,
+			if (perhash->hash_spill == NULL)
+				perhash->hash_spill = palloc0(sizeof(HashAggSpill));
+
+			if (perhash->hash_spill->partitions == NULL)
+				hashagg_spill_init(perhash->hash_spill,
+								   aggstate->hash_tapeinfo, 0,
+								   perhash->phasedata.aggnode->numGroups,
 								   aggstate->hashentrysize);
 
-			hashagg_spill_tuple(spill, slot, hash);
+			hashagg_spill_tuple(perhash->hash_spill,
+								slot,
+								hash);
 		}
 	}
 }
@@ -2111,12 +2180,11 @@ ExecAgg(PlanState *pstate)
 			case AGG_HASHED:
 				if (!node->table_filled)
 					agg_fill_hash_table(node);
-				/* FALLTHROUGH */
-			case AGG_MIXED:
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
 			case AGG_SORTED:
+			case AGG_MIXED:
 				if (!node->input_sorted)
 					agg_sort_input(node);
 				result = agg_retrieve_direct(node);
@@ -2144,8 +2212,8 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->numsets > 0;
-	int			numGroupingSets = Max(aggstate->phase->numsets, 1);
+	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
 	int			numReset;
@@ -2162,7 +2230,7 @@ agg_retrieve_direct(AggState *aggstate)
 	tmpcontext = aggstate->tmpcontext;
 
 	peragg = aggstate->peragg;
-	pergroups = aggstate->pergroups;
+	pergroups = aggstate->phase->pergroups;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
 	/*
@@ -2220,25 +2288,35 @@ agg_retrieve_direct(AggState *aggstate)
 		{
 			if (aggstate->current_phase < aggstate->numphases - 1)
 			{
+				/* Advance to the next phase */
 				initialize_phase(aggstate, aggstate->current_phase + 1);
-				aggstate->input_done = false;
-				aggstate->projected_set = -1;
-				numGroupingSets = Max(aggstate->phase->numsets, 1);
-				node = aggstate->phase->aggnode;
-				numReset = numGroupingSets;
-			}
-			else if (aggstate->aggstrategy == AGG_MIXED)
-			{
-				/*
-				 * Mixed mode; we've output all the grouped stuff and have
-				 * full hashtables, so switch to outputting those.
-				 */
-				initialize_phase(aggstate, 0);
-				aggstate->table_filled = true;
-				ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-									   &aggstate->perhash[0].hashiter);
-				select_current_set(aggstate, 0, true);
-				return agg_retrieve_hash_table(aggstate);
+
+				/* Check whether new phase is an AGG_HASHED */
+				if (!aggstate->phase->is_hashed)
+				{
+					aggstate->input_done = false;
+					aggstate->projected_set = -1;
+					numGroupingSets = aggstate->phase->numsets;
+					node = aggstate->phase->aggnode;
+					numReset = numGroupingSets;
+					pergroups = aggstate->phase->pergroups;
+				}
+				else
+				{
+					AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) aggstate->phase;
+					/* finalize any spills */
+					hashagg_finish_initial_spills(aggstate);
+
+
+					/*
+					 * Mixed mode; we've output all the grouped stuff and have
+					 * full hashtables, so switch to outputting those.
+					 */
+					aggstate->table_filled = true;
+					ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
+					select_current_set(aggstate, 0, true);
+					return agg_retrieve_hash_table(aggstate);
+				}
 			}
 			else
 			{
@@ -2277,11 +2355,11 @@ agg_retrieve_direct(AggState *aggstate)
 		 */
 		tmpcontext->ecxt_innertuple = econtext->ecxt_outertuple;
 		if (aggstate->input_done ||
-			(node->aggstrategy != AGG_PLAIN &&
+			(aggstate->phase->aggnode->numCols > 0 &&
 			 aggstate->projected_set != -1 &&
 			 aggstate->projected_set < (numGroupingSets - 1) &&
 			 nextSetSize > 0 &&
-			 !ExecQualAndReset(aggstate->phase->eqfunctions[nextSetSize - 1],
+			 !ExecQualAndReset(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[nextSetSize - 1],
 							   tmpcontext)))
 		{
 			aggstate->projected_set += 1;
@@ -2384,13 +2462,13 @@ agg_retrieve_direct(AggState *aggstate)
 				for (;;)
 				{
 					/*
-					 * During phase 1 only of a mixed agg, we need to update
-					 * hashtables as well in advance_aggregates.
+					 * If current phase can do transition concurrently, we need
+					 * to update hashtables as well in advance_aggregates.
 					 */
-					if (aggstate->aggstrategy == AGG_MIXED &&
-						aggstate->current_phase == 1)
+					if (aggstate->phase->concurrent_hashes)
 					{
-						lookup_hash_entries(aggstate);
+						lookup_hash_entries(aggstate, NULL,
+											aggstate->phase->concurrent_hashes);
 					}
 
 					/* Advance the aggregates (or combine functions) */
@@ -2404,11 +2482,6 @@ agg_retrieve_direct(AggState *aggstate)
 					{
 						/* no more outer-plan tuples available */
 
-						/* if we built hash tables, finalize any spills */
-						if (aggstate->aggstrategy == AGG_MIXED &&
-							aggstate->current_phase == 1)
-							hashagg_finish_initial_spills(aggstate);
-
 						if (hasGroupingSets)
 						{
 							aggstate->input_done = true;
@@ -2427,10 +2500,10 @@ agg_retrieve_direct(AggState *aggstate)
 					 * If we are grouping, check whether we've crossed a group
 					 * boundary.
 					 */
-					if (node->aggstrategy != AGG_PLAIN)
+					if (aggstate->phase->aggnode->numCols > 0)
 					{
 						tmpcontext->ecxt_innertuple = firstSlot;
-						if (!ExecQual(aggstate->phase->eqfunctions[node->numCols - 1],
+						if (!ExecQual(((AggStatePerPhaseSort) aggstate->phase)->eqfunctions[node->numCols - 1],
 									  tmpcontext))
 						{
 							aggstate->grp_firstTuple = ExecCopySlotHeapTuple(outerslot);
@@ -2479,24 +2552,31 @@ agg_retrieve_direct(AggState *aggstate)
 static void
 agg_sort_input(AggState *aggstate)
 {
-	AggStatePerPhase phase = &aggstate->phases[1];
+	AggStatePerPhase phase = aggstate->phases[0];
+	AggStatePerPhaseSort persort = (AggStatePerPhaseSort) phase;
 	TupleDesc	tupDesc;
 	Sort		*sortnode;
+	bool		randomAccess;
 
 	Assert(!aggstate->input_sorted);
+	Assert(!phase->is_hashed);
 	Assert(phase->aggnode->sortnode);
 
 	sortnode = (Sort *) phase->aggnode->sortnode;
 	tupDesc = ExecGetResultType(outerPlanState(aggstate));
-
-	aggstate->sort_in = tuplesort_begin_heap(tupDesc,
-											 sortnode->numCols,
-											 sortnode->sortColIdx,
-											 sortnode->sortOperators,
-											 sortnode->collations,
-											 sortnode->nullsFirst,
-											 work_mem,
-											 NULL, false);
+	randomAccess = (aggstate->eflags & (EXEC_FLAG_REWIND |
+										EXEC_FLAG_BACKWARD |
+										EXEC_FLAG_MARK)) != 0;
+
+
+	persort->sort_in = tuplesort_begin_heap(tupDesc,
+											sortnode->numCols,
+											sortnode->sortColIdx,
+											sortnode->sortOperators,
+											sortnode->collations,
+											sortnode->nullsFirst,
+											work_mem,
+											NULL, randomAccess);
 	for (;;)
 	{
 		TupleTableSlot *outerslot;
@@ -2505,11 +2585,11 @@ agg_sort_input(AggState *aggstate)
 		if (TupIsNull(outerslot))
 			break;
 
-		tuplesort_puttupleslot(aggstate->sort_in, outerslot);
+		tuplesort_puttupleslot(persort->sort_in, outerslot);
 	}
 
 	/* Sort the first phase */
-	tuplesort_performsort(aggstate->sort_in);
+	tuplesort_performsort(persort->sort_in);
 
 	/* Mark the input to be sorted */
 	aggstate->input_sorted = true;
@@ -2521,8 +2601,14 @@ agg_sort_input(AggState *aggstate)
 static void
 agg_fill_hash_table(AggState *aggstate)
 {
+	AggStatePerPhaseHash current_hash;
 	TupleTableSlot *outerslot;
 	ExprContext *tmpcontext = aggstate->tmpcontext;
+	List *concurrent_hashes = aggstate->phase->concurrent_hashes;
+
+	/* Current phase must be the first phase */
+	Assert(aggstate->current_phase == 0);
+	current_hash = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * Process each outer-plan tuple, and then fetch the next one, until we
@@ -2530,7 +2616,7 @@ agg_fill_hash_table(AggState *aggstate)
 	 */
 	for (;;)
 	{
-		outerslot = fetch_input_tuple(aggstate);
+		outerslot = ExecProcNode(outerPlanState(aggstate));
 		if (TupIsNull(outerslot))
 			break;
 
@@ -2538,7 +2624,7 @@ agg_fill_hash_table(AggState *aggstate)
 		tmpcontext->ecxt_outertuple = outerslot;
 
 		/* Find or build hashtable entries */
-		lookup_hash_entries(aggstate);
+		lookup_hash_entries(aggstate, current_hash, concurrent_hashes);
 
 		/* Advance the aggregates (or combine functions) */
 		advance_aggregates(aggstate);
@@ -2556,8 +2642,7 @@ agg_fill_hash_table(AggState *aggstate)
 	aggstate->table_filled = true;
 	/* Initialize to walk the first hash table */
 	select_current_set(aggstate, 0, true);
-	ResetTupleHashIterator(aggstate->perhash[0].hashtable,
-						   &aggstate->perhash[0].hashiter);
+	ResetTupleHashIterator(current_hash->hashtable, &current_hash->hashiter);
 }
 
 /*
@@ -2575,6 +2660,7 @@ agg_fill_hash_table(AggState *aggstate)
 static bool
 agg_refill_hash_table(AggState *aggstate)
 {
+	AggStatePerPhaseHash perhash;
 	HashAggBatch	*batch;
 	HashAggSpill	 spill;
 	HashTapeInfo	*tapeinfo = aggstate->hash_tapeinfo;
@@ -2586,6 +2672,7 @@ agg_refill_hash_table(AggState *aggstate)
 
 	batch = linitial(aggstate->hash_batches);
 	aggstate->hash_batches = list_delete_first(aggstate->hash_batches);
+	perhash = (AggStatePerPhaseHash) aggstate->phases[batch->phaseidx];
 
 	/*
 	 * Estimate the number of groups for this batch as the total number of
@@ -2600,32 +2687,15 @@ agg_refill_hash_table(AggState *aggstate)
 						batch->used_bits, &aggstate->hash_mem_limit,
 						&aggstate->hash_ngroups_limit, NULL);
 
-	/* there could be residual pergroup pointers; clear them */
-	for (int setoff = 0;
-		 setoff < aggstate->maxsets + aggstate->num_hashes;
-		 setoff++)
-		aggstate->all_pergroups[setoff] = NULL;
-
 	/* free memory and reset hash tables */
 	ReScanExprContext(aggstate->hashcontext);
-	for (int setno = 0; setno < aggstate->num_hashes; setno++)
-		ResetTupleHashTable(aggstate->perhash[setno].hashtable);
+	ResetTupleHashTable(perhash->hashtable);
 
 	aggstate->hash_ngroups_current = 0;
 
-	/*
-	 * In AGG_MIXED mode, hash aggregation happens in phase 1 and the output
-	 * happens in phase 0. So, we switch to phase 1 when processing a batch,
-	 * and back to phase 0 after the batch is done.
-	 */
-	Assert(aggstate->current_phase == 0);
-	if (aggstate->phase->aggstrategy == AGG_MIXED)
-	{
-		aggstate->current_phase = 1;
-		aggstate->phase = &aggstate->phases[aggstate->current_phase];
-	}
-
-	select_current_set(aggstate, batch->setno, true);
+	/* switch to the phase of current batch */
+	initialize_phase(aggstate, batch->phaseidx);
+	select_current_set(aggstate, 0, true);
 
 	/*
 	 * Spilled tuples are always read back as MinimalTuples, which may be
@@ -2634,7 +2704,8 @@ agg_refill_hash_table(AggState *aggstate)
 	 * We still need the NULL check, because we are only processing one
 	 * grouping set at a time and the rest will be NULL.
 	 */
-	hashagg_recompile_expressions(aggstate, true, true);
+	aggstate->hash_agg_stage = HASHAGG_REFILL;
+	hashagg_recompile_expressions(aggstate);
 
 	LogicalTapeRewindForRead(tapeinfo->tapeset, batch->input_tapenum,
 							 HASHAGG_READ_BUFFER_SIZE);
@@ -2653,9 +2724,9 @@ agg_refill_hash_table(AggState *aggstate)
 		ExecStoreMinimalTuple(tuple, slot, true);
 		aggstate->tmpcontext->ecxt_outertuple = slot;
 
-		prepare_hash_slot(aggstate);
-		aggstate->hash_pergroup[batch->setno] = lookup_hash_entry(
-			aggstate, hash, &in_hash_table);
+		prepare_hash_slot(aggstate, perhash);
+		perhash->phasedata.pergroups[0] =
+			lookup_hash_entry(aggstate, perhash, hash, &in_hash_table);
 
 		if (in_hash_table)
 		{
@@ -2687,14 +2758,10 @@ agg_refill_hash_table(AggState *aggstate)
 
 	hashagg_tapeinfo_release(tapeinfo, batch->input_tapenum);
 
-	/* change back to phase 0 */
-	aggstate->current_phase = 0;
-	aggstate->phase = &aggstate->phases[aggstate->current_phase];
-
 	if (spill_initialized)
 	{
 		hash_agg_update_metrics(aggstate, true, spill.npartitions);
-		hashagg_spill_finish(aggstate, &spill, batch->setno);
+		hashagg_spill_finish(aggstate, &spill, batch->phaseidx);
 	}
 	else
 		hash_agg_update_metrics(aggstate, true, 0);
@@ -2702,9 +2769,7 @@ agg_refill_hash_table(AggState *aggstate)
 	aggstate->hash_spill_mode = false;
 
 	/* prepare to walk the first hash table */
-	select_current_set(aggstate, batch->setno, true);
-	ResetTupleHashIterator(aggstate->perhash[batch->setno].hashtable,
-						   &aggstate->perhash[batch->setno].hashiter);
+	ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 	pfree(batch);
 
@@ -2752,7 +2817,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	TupleHashEntryData *entry;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	AggStatePerHash perhash;
+	AggStatePerPhaseHash perhash;
 
 	/*
 	 * get state info from node.
@@ -2763,11 +2828,7 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 	peragg = aggstate->peragg;
 	firstSlot = aggstate->ss.ss_ScanTupleSlot;
 
-	/*
-	 * Note that perhash (and therefore anything accessed through it) can
-	 * change inside the loop, as we change between grouping sets.
-	 */
-	perhash = &aggstate->perhash[aggstate->current_set];
+	perhash = (AggStatePerPhaseHash) aggstate->phase;
 
 	/*
 	 * We loop retrieving groups until we find one satisfying
@@ -2786,18 +2847,16 @@ agg_retrieve_hash_table_in_memory(AggState *aggstate)
 		entry = ScanTupleHashTable(perhash->hashtable, &perhash->hashiter);
 		if (entry == NULL)
 		{
-			int			nextset = aggstate->current_set + 1;
-
-			if (nextset < aggstate->num_hashes)
+			if (aggstate->current_phase + 1 < aggstate->numphases &&
+				aggstate->hash_agg_stage != HASHAGG_REFILL)
 			{
 				/*
 				 * Switch to next grouping set, reinitialize, and restart the
 				 * loop.
 				 */
-				select_current_set(aggstate, nextset, true);
-
-				perhash = &aggstate->perhash[aggstate->current_set];
-
+				select_current_set(aggstate, 0, true);
+				initialize_phase(aggstate, aggstate->current_phase + 1);
+				perhash = (AggStatePerPhaseHash) aggstate->phase;
 				ResetTupleHashIterator(perhash->hashtable, &perhash->hashiter);
 
 				continue;
@@ -2992,12 +3051,12 @@ hashagg_spill_tuple(HashAggSpill *spill, TupleTableSlot *slot, uint32 hash)
  * be done.
  */
 static HashAggBatch *
-hashagg_batch_new(LogicalTapeSet *tapeset, int tapenum, int setno,
+hashagg_batch_new(LogicalTapeSet *tapeset, int tapenum, int phaseidx,
 				  int64 input_tuples, int used_bits)
 {
 	HashAggBatch *batch = palloc0(sizeof(HashAggBatch));
 
-	batch->setno = setno;
+	batch->phaseidx = phaseidx;
 	batch->used_bits = used_bits;
 	batch->tapeset = tapeset;
 	batch->input_tapenum = tapenum;
@@ -3063,25 +3122,31 @@ hashagg_batch_read(HashAggBatch *batch, uint32 *hashp)
 static void
 hashagg_finish_initial_spills(AggState *aggstate)
 {
-	int setno;
+	int phaseidx;
 	int total_npartitions = 0;
 
-	if (aggstate->hash_spills != NULL)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		for (setno = 0; setno < aggstate->num_hashes; setno++)
+		AggStatePerPhaseHash	perhash;
+		AggStatePerPhase		phase = aggstate->phases[phaseidx];
+
+		if (!phase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) phase;
+		if (perhash->hash_spill)
 		{
-			HashAggSpill *spill = &aggstate->hash_spills[setno];
-			total_npartitions += spill->npartitions;
-			hashagg_spill_finish(aggstate, spill, setno);
-		}
+			total_npartitions += perhash->hash_spill->npartitions;
+			hashagg_spill_finish(aggstate, perhash->hash_spill, phase->phaseidx);
 
-		/*
-		 * We're not processing tuples from outer plan any more; only
-		 * processing batches of spilled tuples. The initial spill structures
-		 * are no longer needed.
-		 */
-		pfree(aggstate->hash_spills);
-		aggstate->hash_spills = NULL;
+			/*
+			 * We're not processing tuples from outer plan any more; only
+			 * processing batches of spilled tuples. The initial spill structures
+			 * are no longer needed.
+			 */
+			pfree(perhash->hash_spill);
+			perhash->hash_spill = NULL;
+		}
 	}
 
 	hash_agg_update_metrics(aggstate, false, total_npartitions);
@@ -3094,7 +3159,7 @@ hashagg_finish_initial_spills(AggState *aggstate)
  * Transform spill partitions into new batches.
  */
 static void
-hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
+hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int phaseidx)
 {
 	int i;
 	int used_bits = 32 - spill->shift;
@@ -3112,7 +3177,7 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 			continue;
 
 		new_batch = hashagg_batch_new(aggstate->hash_tapeinfo->tapeset,
-									  tapenum, setno, spill->ntuples[i],
+									  tapenum, phaseidx, spill->ntuples[i],
 									  used_bits);
 		aggstate->hash_batches = lcons(new_batch, aggstate->hash_batches);
 		aggstate->hash_batches_used++;
@@ -3128,21 +3193,25 @@ hashagg_spill_finish(AggState *aggstate, HashAggSpill *spill, int setno)
 static void
 hashagg_reset_spill_state(AggState *aggstate)
 {
-	ListCell *lc;
+	ListCell	*lc;
+	int			phaseidx;
 
 	/* free spills from initial pass */
-	if (aggstate->hash_spills != NULL)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		int setno;
+		AggStatePerPhaseHash	perhash;
+		AggStatePerPhase		phase = aggstate->phases[phaseidx];
+
+		if (!phase->is_hashed)
+			continue;
+
+		perhash = (AggStatePerPhaseHash) phase;
 
-		for (setno = 0; setno < aggstate->num_hashes; setno++)
+		if (perhash->hash_spill)
 		{
-			HashAggSpill *spill = &aggstate->hash_spills[setno];
-			pfree(spill->ntuples);
-			pfree(spill->partitions);
+			pfree(perhash->hash_spill);
+			perhash->hash_spill = NULL;
 		}
-		pfree(aggstate->hash_spills);
-		aggstate->hash_spills = NULL;
 	}
 
 	/* free batches */
@@ -3181,25 +3250,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	AggState   *aggstate;
 	AggStatePerAgg peraggs;
 	AggStatePerTrans pertransstates;
-	AggStatePerGroup *pergroups;
 	Plan	   *outerPlan;
 	ExprContext *econtext;
 	TupleDesc	scanDesc;
-	Agg			*firstSortAgg;
 	int			numaggs,
 				transno,
 				aggno;
-	int			phase;
 	int			phaseidx;
 	ListCell   *l;
 	Bitmapset  *all_grouped_cols = NULL;
 	int			numGroupingSets = 1;
-	int			numPhases;
-	int			numHashes;
 	int			i = 0;
 	int			j = 0;
+	bool		need_extra_slot = false;
 	bool		use_hashing = (node->aggstrategy == AGG_HASHED ||
 							   node->aggstrategy == AGG_MIXED);
+	uint64		totalHashGroups = 0;
 
 	/* check for unsupported flags */
 	Assert(!(eflags & (EXEC_FLAG_BACKWARD | EXEC_FLAG_MARK)));
@@ -3226,24 +3292,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->curpertrans = NULL;
 	aggstate->input_done = false;
 	aggstate->agg_done = false;
-	aggstate->pergroups = NULL;
 	aggstate->grp_firstTuple = NULL;
-	aggstate->sort_in = NULL;
-	aggstate->sort_out = NULL;
 	aggstate->input_sorted = true;
-
-	/*
-	 * phases[0] always exists, but is dummy in sorted/plain mode
-	 */
-	numPhases = (use_hashing ? 1 : 2);
-	numHashes = (use_hashing ? 1 : 0);
-
-	firstSortAgg = node->aggstrategy == AGG_SORTED ? node : NULL;
+	aggstate->eflags = eflags;
+	aggstate->num_hashes = 0;
+	aggstate->hash_agg_stage = HASHAGG_INITIAL;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
-	 * determines the size of some allocations.  Also calculate the number of
-	 * phases, since all hashed/mixed nodes contribute to only a single phase.
+	 * determines the size of some allocations.
 	 */
 	if (node->groupingSets)
 	{
@@ -3256,31 +3313,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			numGroupingSets = Max(numGroupingSets,
 								  list_length(agg->groupingSets));
 
-			/*
-			 * additional AGG_HASHED aggs become part of phase 0, but all
-			 * others add an extra phase.
-			 */
 			if (agg->aggstrategy != AGG_HASHED)
-			{
-				++numPhases;
-
-				if (!firstSortAgg)
-					firstSortAgg = agg;
-
-			}
-			else
-				++numHashes;
+				need_extra_slot = true;
 		}
 	}
 
 	aggstate->maxsets = numGroupingSets;
-	aggstate->numphases = numPhases;
+	aggstate->numphases = 1 + list_length(node->chain);
 
 	/*
-	 * The first SORTED phase is not sorted, agg need to do its own sort. See
+	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
-	if (firstSortAgg && firstSortAgg->sortnode)
+	if (node->sortnode)
 		aggstate->input_sorted = false;	
 
 	aggstate->aggcontexts = (ExprContext **)
@@ -3341,11 +3386,10 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	scanDesc = aggstate->ss.ss_ScanTupleSlot->tts_tupleDescriptor;
 
 	/*
-	 * If there are more than two phases (including a potential dummy phase
-	 * 0), input will be resorted using tuplesort. Need a slot for that.
+	 * An extra slot is needed if 1) agg need to do its own sort 2) agg
+	 * has more than one non-hashed phases
 	 */
-	if (numPhases > 2 ||
-		!aggstate->input_sorted)
+	if (node->sortnode || need_extra_slot)
 	{
 		aggstate->sort_slot = ExecInitExtraTupleSlot(estate, scanDesc,
 													 &TTSOpsMinimalTuple);
@@ -3401,72 +3445,92 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 * For each phase, prepare grouping set data and fmgr lookup data for
 	 * compare functions.  Accumulate all_grouped_cols in passing.
 	 */
-	aggstate->phases = palloc0(numPhases * sizeof(AggStatePerPhaseData));
-
-	aggstate->num_hashes = numHashes;
-	if (numHashes)
-	{
-		aggstate->perhash = palloc0(sizeof(AggStatePerHashData) * numHashes);
-		aggstate->phases[0].numsets = 0;
-		aggstate->phases[0].gset_lengths = palloc(numHashes * sizeof(int));
-		aggstate->phases[0].grouped_cols = palloc(numHashes * sizeof(Bitmapset *));
-	}
+	aggstate->phases = palloc0(aggstate->numphases * sizeof(AggStatePerPhase));
 
-	phase = 0;
-	for (phaseidx = 0; phaseidx <= list_length(node->chain); ++phaseidx)
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
 		Agg		   *aggnode;
+		AggStatePerPhase phasedata = NULL;
 
 		if (phaseidx > 0)
 			aggnode = list_nth_node(Agg, node->chain, phaseidx - 1);
 		else
 			aggnode = node;
 
-		if (aggnode->aggstrategy == AGG_HASHED
-			|| aggnode->aggstrategy == AGG_MIXED)
+		if (aggnode->aggstrategy == AGG_HASHED)
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[0];
-			AggStatePerHash perhash;
-			Bitmapset  *cols = NULL;
-
-			Assert(phase == 0);
-			i = phasedata->numsets++;
-			perhash = &aggstate->perhash[i];
+			AggStatePerPhaseHash perhash;
+			Bitmapset *cols = NULL;
 
-			/* phase 0 always points to the "real" Agg in the hash case */
-			phasedata->aggnode = node;
-			phasedata->aggstrategy = node->aggstrategy;
+			aggstate->num_hashes++;
+			totalHashGroups += aggnode->numGroups;
 
-			/* but the actual Agg node representing this hash is saved here */
-			perhash->aggnode = aggnode;
+			perhash = (AggStatePerPhaseHash) palloc0(sizeof(AggStatePerPhaseHashData));
+			phasedata = (AggStatePerPhase) perhash;
+			phasedata->is_hashed = true;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			phasedata->gset_lengths[i] = perhash->numCols = aggnode->numCols;
+			/* AGG_HASHED always has only one set */
+			phasedata->numsets = 1;
+			phasedata->gset_lengths = palloc(sizeof(int));
+			phasedata->gset_lengths[0] = perhash->numCols = aggnode->numCols;
 
+			phasedata->grouped_cols = palloc(sizeof(Bitmapset *));
 			for (j = 0; j < aggnode->numCols; ++j)
 				cols = bms_add_member(cols, aggnode->grpColIdx[j]);
-
-			phasedata->grouped_cols[i] = cols;
+			phasedata->grouped_cols[0] = cols;
 
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
-			continue;
+
+			/*
+			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
+			 * on the fly, all pergroup states are kept in hashtable, everytime
+			 * a tuple is processed, lookup_hash_entry() choose one group and
+			 * set phasedata->pergroups[0], then advance_aggregates can use it
+			 * to do transition in this group.
+			 * We do not need to allocate a real pergroup and set the pointer
+			 * here, there are too many pergroup states, lookup_hash_entry() will
+			 * allocate it.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup));
+
+			/*
+			 * Hash aggregate does not require the order of input tuples, so
+			 * we can do the transition immediately when a tuple is fetched,
+			 * which means we can do the transition concurrently with the
+			 * first phase.
+			 */
+			if (phaseidx > 0)
+			{
+				aggstate->phases[0]->concurrent_hashes =
+					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
+				/* skip evaltrans for this phase */
+				phasedata->skip_evaltrans = true;
+			}
 		}
 		else
 		{
-			AggStatePerPhase phasedata = &aggstate->phases[++phase];
-			int			num_sets;
+			AggStatePerPhaseSort persort;
 
-			phasedata->numsets = num_sets = list_length(aggnode->groupingSets);
+			persort = (AggStatePerPhaseSort) palloc0(sizeof(AggStatePerPhaseSortData));
+			phasedata = (AggStatePerPhase) persort;
+			phasedata->is_hashed = false;
+			phasedata->aggnode = aggnode;
+			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (num_sets)
+			if (aggnode->groupingSets)
 			{
-				phasedata->gset_lengths = palloc(num_sets * sizeof(int));
-				phasedata->grouped_cols = palloc(num_sets * sizeof(Bitmapset *));
+				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
+				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
 
 				i = 0;
 				foreach(l, aggnode->groupingSets)
 				{
-					int			current_length = list_length(lfirst(l));
-					Bitmapset  *cols = NULL;
+					int		current_length = list_length(lfirst(l));
+					Bitmapset	*cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -3483,37 +3547,49 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 			else
 			{
-				Assert(phaseidx == 0);
-
+				phasedata->numsets = 1;
 				phasedata->gset_lengths = NULL;
 				phasedata->grouped_cols = NULL;
 			}
 
 			/*
+			 * Initialize pergroup states for AGG_SORTED/AGG_PLAIN/AGG_MIXED
+			 * phases, each set only have one group on the fly, all groups in
+			 * a set can reuse a pergroup state. Unlike AGG_HASHED, we
+			 * pre-allocate the pergroup states here.
+			 */
+			phasedata->pergroups =
+				(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup) * phasedata->numsets);
+
+			for (i = 0; i < phasedata->numsets; i++)
+			{
+				phasedata->pergroups[i] =
+					(AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData) * numaggs);
+			}
+
+			/*
 			 * If we are grouping, precompute fmgr lookup data for inner loop.
 			 */
-			if (aggnode->aggstrategy == AGG_SORTED)
+			if (aggnode->numCols > 0)
 			{
 				int			i = 0;
 
-				Assert(aggnode->numCols > 0);
-
 				/*
 				 * Build a separate function for each subset of columns that
 				 * need to be compared.
 				 */
-				phasedata->eqfunctions =
+				persort->eqfunctions =
 					(ExprState **) palloc0(aggnode->numCols * sizeof(ExprState *));
 
 				/* for each grouping set */
-				for (i = 0; i < phasedata->numsets; i++)
+				for (i = 0; i < phasedata->numsets && phasedata->gset_lengths; i++)
 				{
 					int			length = phasedata->gset_lengths[i];
 
-					if (phasedata->eqfunctions[length - 1] != NULL)
+					if (persort->eqfunctions[length - 1] != NULL)
 						continue;
 
-					phasedata->eqfunctions[length - 1] =
+					persort->eqfunctions[length - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   length,
 											   aggnode->grpColIdx,
@@ -3523,9 +3599,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 
 				/* and for all grouped columns, unless already computed */
-				if (phasedata->eqfunctions[aggnode->numCols - 1] == NULL)
+				if (persort->eqfunctions[aggnode->numCols - 1] == NULL)
 				{
-					phasedata->eqfunctions[aggnode->numCols - 1] =
+					persort->eqfunctions[aggnode->numCols - 1] =
 						execTuplesMatchPrepare(scanDesc,
 											   aggnode->numCols,
 											   aggnode->grpColIdx,
@@ -3535,9 +3611,24 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				}
 			}
 
-			phasedata->aggnode = aggnode;
-			phasedata->aggstrategy = aggnode->aggstrategy;
+			/*
+			 * For non-first AGG_SORTED phase, it processes the same input
+			 * tuples with previous phase except that it need to resort the
+			 * input tuples. Tell the previous phase to copy out the tuples.
+			 */
+			if (phaseidx > 0)
+			{
+				AggStatePerPhaseSort prev =
+					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
+
+				Assert(!prev->phasedata.is_hashed);
+				/* Tell the previous phase to copy the tuple to the sort_in */
+				prev->copy_out = true;
+			}
 		}
+
+		phasedata->phaseidx = phaseidx;
+		aggstate->phases[phaseidx] = phasedata;
 	}
 
 	/*
@@ -3561,51 +3652,9 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->peragg = peraggs;
 	aggstate->pertrans = pertransstates;
 
-
-	aggstate->all_pergroups =
-		(AggStatePerGroup *) palloc0(sizeof(AggStatePerGroup)
-									 * (numGroupingSets + numHashes));
-	pergroups = aggstate->all_pergroups;
-
-	if (node->aggstrategy != AGG_HASHED)
-	{
-		for (i = 0; i < numGroupingSets; i++)
-		{
-			pergroups[i] = (AggStatePerGroup) palloc0(sizeof(AggStatePerGroupData)
-													  * numaggs);
-		}
-
-		aggstate->pergroups = pergroups;
-		pergroups += numGroupingSets;
-	}
-
-	/*
-	 * Hashing can only appear in the initial phase.
-	 */
-	if (use_hashing)
-	{
-		/* this is an array of pointers, not structures */
-		aggstate->hash_pergroup = pergroups;
-	}
-
-	/*
-	 * Initialize current phase-dependent values to initial phase. The initial
-	 * phase is 1 (first sort pass) for all strategies that use sorting (if
-	 * hashing is being done too, then phase 0 is processed last); but if only
-	 * hashing is being done, then phase 0 is all there is.
-	 */
-	if (node->aggstrategy == AGG_HASHED)
-	{
-		aggstate->current_phase = 0;
-		initialize_phase(aggstate, 0);
-		select_current_set(aggstate, 0, true);
-	}
-	else
-	{
-		aggstate->current_phase = 1;
-		initialize_phase(aggstate, 1);
-		select_current_set(aggstate, 0, false);
-	}
+	aggstate->current_phase = 0;
+	initialize_phase(aggstate, 0);
+	select_current_set(aggstate, 0, aggstate->aggstrategy == AGG_HASHED);
 
 	/* -----------------
 	 * Perform lookups of aggregate function info, and initialize the
@@ -3941,12 +3990,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
-	/* Initialize hash contexts and hash tables for hash aggregates */
+	/*
+	 * Initialize current phase-dependent values to initial phase.
+	 */
 	if (use_hashing)
 	{
 		Plan   *outerplan = outerPlan(node);
-		uint64	totalGroups = 0;
-		int 	i;
 
 		aggstate->hash_metacxt = AllocSetContextCreate(
 			aggstate->ss.ps.state->es_query_cxt,
@@ -3964,10 +4013,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		 * when there is more than one grouping set, but should still be
 		 * reasonable.
 		 */
-		for (i = 0; i < aggstate->num_hashes; i++)
-			totalGroups += aggstate->perhash[i].aggnode->numGroups;
-
-		hash_agg_set_limits(aggstate->hashentrysize, totalGroups, 0,
+		hash_agg_set_limits(aggstate->hashentrysize, totalHashGroups, 0,
 							&aggstate->hash_mem_limit,
 							&aggstate->hash_ngroups_limit,
 							&aggstate->hash_planned_partitions);
@@ -3986,51 +4032,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	 */
 	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
 	{
-		AggStatePerPhase phase = &aggstate->phases[phaseidx];
-		bool		dohash = false;
-		bool		dosort = false;
+		AggStatePerPhase phase = aggstate->phases[phaseidx];
 
-		/* phase 0 doesn't necessarily exist */
-		if (!phase->aggnode)
+		if (phase->skip_evaltrans)
 			continue;
 
-		if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 1)
-		{
-			/*
-			 * Phase one, and only phase one, in a mixed agg performs both
-			 * sorting and aggregation.
-			 */
-			dohash = true;
-			dosort = true;
-		}
-		else if (aggstate->aggstrategy == AGG_MIXED && phaseidx == 0)
-		{
-			/*
-			 * No need to compute a transition function for an AGG_MIXED phase
-			 * 0 - the contents of the hashtables will have been computed
-			 * during phase 1.
-			 */
-			continue;
-		}
-		else if (phase->aggstrategy == AGG_PLAIN ||
-				 phase->aggstrategy == AGG_SORTED)
-		{
-			dohash = false;
-			dosort = true;
-		}
-		else if (phase->aggstrategy == AGG_HASHED)
-		{
-			dohash = true;
-			dosort = false;
-		}
-		else
-			Assert(false);
-
-		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
+		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, false, true);
 
 		/* cache compiled expression for outer slot without NULL check */
-		phase->evaltrans_cache[0][0] = phase->evaltrans;
+		phase->evaltrans_cache[HASHAGG_INITIAL] = phase->evaltrans;
 	}
 
 	return aggstate;
@@ -4516,13 +4526,21 @@ ExecEndAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	/* Make sure we have closed any open tuplesorts */
+	for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
+	{
+		AggStatePerPhase		phase = node->phases[phaseidx];
+		AggStatePerPhaseSort	persort;
 
-	if (node->sort_in)
-		tuplesort_end(node->sort_in);
-	if (node->sort_out)
-		tuplesort_end(node->sort_out);
+		if (phase->is_hashed)
+			continue;
+
+		persort = (AggStatePerPhaseSort) phase;
+		if (persort->sort_in)
+			tuplesort_end(persort->sort_in);
+	}
 
 	hashagg_reset_spill_state(node);
 
@@ -4572,6 +4590,7 @@ ExecReScanAgg(AggState *node)
 	int			transno;
 	int			numGroupingSets = Max(node->maxsets, 1);
 	int			setno;
+	int			phaseidx;
 
 	node->agg_done = false;
 
@@ -4596,8 +4615,12 @@ ExecReScanAgg(AggState *node)
 		if (outerPlan->chgParam == NULL && !node->hash_ever_spilled &&
 			!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
 		{
-			ResetTupleHashIterator(node->perhash[0].hashtable,
-								   &node->perhash[0].hashiter);
+			AggStatePerPhaseHash perhash = (AggStatePerPhaseHash) node->phases[0];
+			ResetTupleHashIterator(perhash->hashtable,
+								   &perhash->hashiter);
+
+			/* reset to phase 0 */
+			initialize_phase(node, 0);
 			select_current_set(node, 0, true);
 			return;
 		}
@@ -4662,7 +4685,8 @@ ExecReScanAgg(AggState *node)
 		node->table_filled = false;
 		/* iterator will be reset when the table is filled */
 
-		hashagg_recompile_expressions(node, false, false);
+		node->hash_agg_stage = HASHAGG_INITIAL;
+		hashagg_recompile_expressions(node);
 	}
 
 	if (node->aggstrategy != AGG_HASHED)
@@ -4670,18 +4694,54 @@ ExecReScanAgg(AggState *node)
 		/*
 		 * Reset the per-group state (in particular, mark transvalues null)
 		 */
-		for (setno = 0; setno < numGroupingSets; setno++)
+		for (phaseidx = 0; phaseidx < node->numphases; phaseidx++)
 		{
-			MemSet(node->pergroups[setno], 0,
-				   sizeof(AggStatePerGroupData) * node->numaggs);
+			AggStatePerPhase phase = node->phases[phaseidx];
+
+			/* hash pergroups is reset by build_hash_tables */
+			if (phase->is_hashed)
+				continue;
+
+			for (setno = 0; setno < phase->numsets; setno++)
+				MemSet(phase->pergroups[setno], 0,
+					   sizeof(AggStatePerGroupData) * node->numaggs);
 		}
 
-		/* Reset input_sorted */
+		/*
+		 * The agg did its own first sort using tuplesort and the first
+		 * tuplesort is kept (see initialize_phase), if the subplan does
+		 * not have any parameter changes, and none of our own parameter
+		 * changes affect input expressions of the aggregated functions,
+		 * then we can just rescan the first tuplesort, no need to build
+		 * it again.
+		 *
+		 * Note: agg only do its own sort for groupingsets now.
+		 */
 		if (aggnode->sortnode)
-			node->input_sorted = false;
+		{
+			AggStatePerPhaseSort firstphase = (AggStatePerPhaseSort) node->phases[0];
+			bool randomAccess = (node->eflags & (EXEC_FLAG_REWIND |
+												 EXEC_FLAG_BACKWARD |
+												 EXEC_FLAG_MARK)) != 0;
+			if (firstphase->sort_in &&
+				randomAccess &&
+				outerPlan->chgParam == NULL &&
+				!bms_overlap(node->ss.ps.chgParam, aggnode->aggParams))
+			{
+				tuplesort_rescan(firstphase->sort_in);
+				node->input_sorted = true;
+			}
+			else
+			{
+				if (firstphase->sort_in)
+					tuplesort_end(firstphase->sort_in);
+				firstphase->sort_in = NULL;
+				node->input_sorted = false;
+			}
+		}
 
-		/* reset to phase 1 */
-		initialize_phase(node, 1);
+		/* reset to phase 0 */
+		initialize_phase(node, 0);
 
 		node->input_done = false;
 		node->projected_set = -1;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index b855e73..066cd59 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -2049,30 +2049,26 @@ llvm_compile_expr(ExprState *state)
 			case EEOP_AGG_PLAIN_PERGROUP_NULLCHECK:
 				{
 					int				 jumpnull;
-					LLVMValueRef	 v_aggstatep;
-					LLVMValueRef	 v_allpergroupsp;
+					LLVMValueRef	 v_pergroupsp;
 					LLVMValueRef	 v_pergroup_allaggs;
-					LLVMValueRef	 v_setoff;
+					LLVMValueRef	 v_setno;
 
 					jumpnull = op->d.agg_plain_pergroup_nullcheck.jumpnull;
 
 					/*
-					 * pergroup_allaggs = aggstate->all_pergroups
-					 * [op->d.agg_plain_pergroup_nullcheck.setoff];
+					 * pergroup =
+					 * &op->d.agg_plain_pergroup_nullcheck.pergroups
+					 * [op->d.agg_plain_pergroup_nullcheck.setno];
 					 */
-					v_aggstatep = LLVMBuildBitCast(
-						b, v_parent, l_ptr(StructAggState), "");
+					v_pergroupsp =
+						l_ptr_const(op->d.agg_plain_pergroup_nullcheck.pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
 
-					v_allpergroupsp = l_load_struct_gep(
-						b, v_aggstatep,
-						FIELDNO_AGGSTATE_ALL_PERGROUPS,
-						"aggstate.all_pergroups");
+					v_setno =
+						l_int32_const(op->d.agg_plain_pergroup_nullcheck.setno);
 
-					v_setoff = l_int32_const(
-						op->d.agg_plain_pergroup_nullcheck.setoff);
-
-					v_pergroup_allaggs = l_load_gep1(
-						b, v_allpergroupsp, v_setoff, "");
+					v_pergroup_allaggs =
+						l_load_gep1(b, v_pergroupsp, v_setno, "");
 
 					LLVMBuildCondBr(
 						b,
@@ -2094,6 +2090,7 @@ llvm_compile_expr(ExprState *state)
 				{
 					AggState   *aggstate;
 					AggStatePerTrans pertrans;
+					AggStatePerGroup *pergroups;
 					FunctionCallInfo fcinfo;
 
 					LLVMValueRef v_aggstatep;
@@ -2103,12 +2100,12 @@ llvm_compile_expr(ExprState *state)
 					LLVMValueRef v_transvaluep;
 					LLVMValueRef v_transnullp;
 
-					LLVMValueRef v_setoff;
+					LLVMValueRef v_setno;
 					LLVMValueRef v_transno;
 
 					LLVMValueRef v_aggcontext;
 
-					LLVMValueRef v_allpergroupsp;
+					LLVMValueRef v_pergroupsp;
 					LLVMValueRef v_current_setp;
 					LLVMValueRef v_current_pertransp;
 					LLVMValueRef v_curaggcontext;
@@ -2124,6 +2121,7 @@ llvm_compile_expr(ExprState *state)
 
 					aggstate = castNode(AggState, state->parent);
 					pertrans = op->d.agg_trans.pertrans;
+					pergroups = op->d.agg_trans.pergroups;
 
 					fcinfo = pertrans->transfn_fcinfo;
 
@@ -2133,19 +2131,18 @@ llvm_compile_expr(ExprState *state)
 											  l_ptr(StructAggStatePerTransData));
 
 					/*
-					 * pergroup = &aggstate->all_pergroups
-					 * [op->d.agg_strict_trans_check.setoff]
-					 * [op->d.agg_init_trans_check.transno];
+					 * pergroup = &op->d.agg_trans.pergroups
+					 * [op->d.agg_trans.setno]
+					 * [op->d.agg_trans.transno];
 					 */
-					v_allpergroupsp =
-						l_load_struct_gep(b, v_aggstatep,
-										  FIELDNO_AGGSTATE_ALL_PERGROUPS,
-										  "aggstate.all_pergroups");
-					v_setoff = l_int32_const(op->d.agg_trans.setoff);
+					v_pergroupsp =
+						l_ptr_const(pergroups,
+									l_ptr(l_ptr(StructAggStatePerGroupData)));
+					v_setno = l_int32_const(op->d.agg_trans.setno);
 					v_transno = l_int32_const(op->d.agg_trans.transno);
 					v_pergroupp =
 						LLVMBuildGEP(b,
-									 l_load_gep1(b, v_allpergroupsp, v_setoff, ""),
+									 l_load_gep1(b, v_pergroupsp, v_setno, ""),
 									 &v_transno, 1, "");
 
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 7c29f89..e9ad5a9 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -2226,8 +2226,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	chain = NIL;
 	if (list_length(rollups) > 1)
 	{
-		bool		is_first_sort = ((RollupData *) linitial(rollups))->is_hashed;
-
 		for_each_cell(lc, rollups, list_second_cell(rollups))
 		{
 			RollupData *rollup = lfirst(lc);
@@ -2245,24 +2243,17 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			 */
 			if (!rollup->is_hashed)
 			{
-				if (!is_first_sort ||
-					(is_first_sort && !best_path->is_sorted))
-				{
-					sort_plan = (Plan *)
-						make_sort_from_groupcols(rollup->groupClause,
-												 new_grpColIdx,
-												 subplan);
-
-					/*
-					 * Remove stuff we don't need to avoid bloating debug output.
-					 */
-					sort_plan->targetlist = NIL;
-					sort_plan->lefttree = NULL;
-				}
-			}
+				sort_plan = (Plan *)
+					make_sort_from_groupcols(rollup->groupClause,
+											 new_grpColIdx,
+											 subplan);
 
-			if (!rollup->is_hashed)
-				is_first_sort = false;
+				/*
+				 * Remove stuff we don't need to avoid bloating debug output.
+				 */
+				sort_plan->targetlist = NIL;
+				sort_plan->lefttree = NULL;
+			}
 
 			if (rollup->is_hashed)
 				strat = AGG_HASHED;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 0cab951..2b2391b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4354,7 +4354,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (unhashed_rollup)
 		{
-			new_rollups = lappend(new_rollups, unhashed_rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(unhashed_rollup, new_rollups);
 			strat = AGG_MIXED;
 		}
 		else if (empty_sets)
@@ -4367,7 +4368,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = list_length(empty_sets);
 			rollup->hashable = false;
 			rollup->is_hashed = false;
-			new_rollups = lappend(new_rollups, rollup);
+			/* unhashed rollups always sit before hashed rollups */
+			new_rollups = lcons(rollup, new_rollups);
 			/*
 			 * The first non-hashed rollup is PLAIN AGG, is_sorted
 			 * should be true.
@@ -4536,7 +4538,8 @@ consider_groupingsets_paths(PlannerInfo *root,
 			rollup->numGroups = gs->numGroups;
 			rollup->hashable = true;
 			rollup->is_hashed = true;
-			rollups = lcons(rollup, rollups);
+			/* non-hashed rollup always sit before hashed rollup */
+			rollups = lappend(rollups, rollup);
 		}
 
 		if (rollups)
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 6e88992..4578c31 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -3001,7 +3001,6 @@ create_groupingsets_path(PlannerInfo *root,
 	PathTarget *target = rel->reltarget;
 	ListCell   *lc;
 	bool		is_first = true;
-	bool		is_first_sort = true;
 
 	/* The topmost generated Plan node will be an Agg */
 	pathnode->path.pathtype = T_Agg;
@@ -3054,14 +3053,13 @@ create_groupingsets_path(PlannerInfo *root,
 		int			numGroupCols = list_length(linitial(gsets));
 
 		/*
-		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup takes the
-		 * (already-sorted) input, and following ones do their own sort.
+		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
+		 * sort if is_sorted is false, the following ones do their own sort.
 		 *
 		 * In AGG_HASHED mode, there is one rollup for each grouping set.
 		 *
-		 * In AGG_MIXED mode, the first rollups are hashed, the first
-		 * non-hashed one takes the (already-sorted) input, and following ones
-		 * do their own sort.
+		 * In AGG_MIXED mode, the first rollup do its own sort if is_sorted
+		 * is false, the following non-hashed ones do their own sort.
 		 */
 		if (is_first)
 		{
@@ -3095,33 +3093,21 @@ create_groupingsets_path(PlannerInfo *root,
 					 subpath->rows,
 					 subpath->pathtarget->width);
 			is_first = false;
-			if (!rollup->is_hashed)
-				is_first_sort = false;
 		}
 		else
 		{
+			AggStrategy	rollup_strategy;
 			Path		sort_path;	/* dummy for result of cost_sort */
 			Path		agg_path;	/* dummy for result of cost_agg */
 
-			if (rollup->is_hashed || (is_first_sort && is_sorted))
-			{
-				/*
-				 * Account for cost of aggregation, but don't charge input
-				 * cost again
-				 */
-				cost_agg(&agg_path, root,
-						 rollup->is_hashed ? AGG_HASHED : AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 0.0, 0.0,
-						 subpath->rows,
-						 subpath->pathtarget->width);
-				if (!rollup->is_hashed)
-					is_first_sort = false;
-			}
-			else
+			sort_path.startup_cost = 0;
+			sort_path.total_cost = 0;
+			sort_path.rows = subpath->rows;
+
+			rollup_strategy = rollup->is_hashed ?
+				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
+
+			if (!rollup->is_hashed && numGroupCols)
 			{
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
@@ -3131,21 +3117,20 @@ create_groupingsets_path(PlannerInfo *root,
 						  0.0,
 						  work_mem,
 						  -1.0);
-
-				/* Account for cost of aggregation */
-
-				cost_agg(&agg_path, root,
-						 AGG_SORTED,
-						 agg_costs,
-						 numGroupCols,
-						 rollup->numGroups,
-						 having_qual,
-						 sort_path.startup_cost,
-						 sort_path.total_cost,
-						 sort_path.rows,
-						 subpath->pathtarget->width);
 			}
 
+			/* Account for cost of aggregation */
+			cost_agg(&agg_path, root,
+					 rollup_strategy,
+					 agg_costs,
+					 numGroupCols,
+					 rollup->numGroups,
+					 having_qual,
+					 sort_path.startup_cost,
+					 sort_path.total_cost,
+					 sort_path.rows,
+					 subpath->pathtarget->width);
+
 			pathnode->path.total_cost += agg_path.total_cost;
 			pathnode->path.rows += agg_path.rows;
 		}
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index dbe8649..4ed5d0a 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -626,7 +626,8 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_PLAIN_PERGROUP_NULLCHECK */
 		struct
 		{
-			int			setoff;
+			AggStatePerGroup *pergroups;
+			int			setno;
 			int			jumpnull;
 		}			agg_plain_pergroup_nullcheck;
 
@@ -634,11 +635,11 @@ typedef struct ExprEvalStep
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
 		struct
 		{
+			AggStatePerGroup *pergroups;
 			AggStatePerTrans pertrans;
 			ExprContext *aggcontext;
 			int			setno;
 			int			transno;
-			int			setoff;
 		}			agg_trans;
 	}			d;
 } ExprEvalStep;
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cd0e643..2dda60b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -255,7 +255,7 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool nullcheck, bool allow_concurrent_hashing);
 extern ExprState *ExecBuildGroupingEqual(TupleDesc ldesc, TupleDesc rdesc,
 										 const TupleTableSlotOps *lops, const TupleTableSlotOps *rops,
 										 int numCols,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 9e70bd8..1612b71 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -270,21 +270,33 @@ typedef struct AggStatePerGroupData
  */
 typedef struct AggStatePerPhaseData
 {
+	int			phaseidx;		/* phaseidx */
+	bool		is_hashed;		/* plan to do hash aggregate */
 	AggStrategy aggstrategy;	/* strategy for this phase */
-	int			numsets;		/* number of grouping sets (or 0) */
+	int			numsets;		/* number of grouping sets */
 	int		   *gset_lengths;	/* lengths of grouping sets */
 	Bitmapset **grouped_cols;	/* column groupings for rollup */
-	ExprState **eqfunctions;	/* expression returning equality, indexed by
-								 * nr of cols to compare */
 	Agg		   *aggnode;		/* Agg node for phase data */
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
-
 	/* cached variants of the compiled expression */
-	ExprState  *evaltrans_cache
-				[2]		/* 0: outerops; 1: TTSOpsMinimalTuple */
-				[2];	/* 0: no NULL check; 1: with NULL check */
+	ExprState  *evaltrans_cache[3];
+
+	List		*concurrent_hashes;	/* hash phases can do transition concurrently */
+	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
+
+	bool		skip_evaltrans;		/* do not build evaltrans */
 }			AggStatePerPhaseData;
 
+typedef struct AggStatePerPhaseSortData
+{
+	AggStatePerPhaseData phasedata;
+	Tuplesortstate	*sort_in;		/* sorted input to phases > 1 */
+	Tuplestorestate	*store_in;		/* sorted input to phases > 1 */
+	ExprState 		**eqfunctions;	/* expression returning equality, indexed by
+									 * nr of cols to compare */
+	bool			copy_out;		/* hint for copy input tuples for next phase */
+}			AggStatePerPhaseSortData;
+
 /*
  * AggStatePerHashData - per-hashtable state
  *
@@ -292,8 +304,9 @@ typedef struct AggStatePerPhaseData
  * grouping set. (When doing hashing without grouping sets, we have just one of
  * them.)
  */
-typedef struct AggStatePerHashData
+typedef struct AggStatePerPhaseHashData
 {
+	AggStatePerPhaseData phasedata;
 	TupleHashTable hashtable;	/* hash table with one entry per group */
 	TupleHashIterator hashiter; /* for iterating through hash table */
 	TupleTableSlot *hashslot;	/* slot for loading hash table */
@@ -304,9 +317,8 @@ typedef struct AggStatePerHashData
 	int			largestGrpColIdx;	/* largest col required for hashing */
 	AttrNumber *hashGrpColIdxInput; /* hash col indices in input slot */
 	AttrNumber *hashGrpColIdxHash;	/* indices in hash table tuples */
-	Agg		   *aggnode;		/* original Agg node, for numGroups etc. */
-}			AggStatePerHashData;
-
+	struct HashAggSpill *hash_spill; /* HashAggSpill for current hash grouping set */
+}			AggStatePerPhaseHashData;
 
 extern AggState *ExecInitAgg(Agg *node, EState *estate, int eflags);
 extern void ExecEndAgg(AggState *node);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 75a45b2..688f0c7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2036,7 +2036,8 @@ typedef struct AggStatePerAggData *AggStatePerAgg;
 typedef struct AggStatePerTransData *AggStatePerTrans;
 typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
-typedef struct AggStatePerHashData *AggStatePerHash;
+typedef struct AggStatePerPhaseSortData *AggStatePerPhaseSort;
+typedef struct AggStatePerPhaseHashData *AggStatePerPhaseHash;
 
 typedef struct AggState
 {
@@ -2068,21 +2069,17 @@ typedef struct AggState
 	List	   *all_grouped_cols;	/* list of all grouped cols in DESC order */
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
-	AggStatePerPhase phases;	/* array of all phases */
+	AggStatePerPhase *phases;	/* array of all phases */
 	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
 	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
-	AggStatePerGroup *pergroups;	/* grouping set indexed array of per-group
-									 * pointers */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
-	/* these fields are used in AGG_HASHED and AGG_MIXED modes: */
+	/* these fields are used in AGG_HASHED */
 	bool		table_filled;	/* hash table filled yet? */
 	int			num_hashes;
 	MemoryContext	hash_metacxt;	/* memory for hash table itself */
 	struct HashTapeInfo *hash_tapeinfo; /* metadata for spill tapes */
-	struct HashAggSpill *hash_spills; /* HashAggSpill for each grouping set,
-										 exists only during first pass */
 	TupleTableSlot *hash_spill_slot; /* slot for reading from spill files */
 	List	   *hash_batches;	/* hash batches remaining to be processed */
 	bool		hash_ever_spilled;	/* ever spilled during this execution? */
@@ -2098,18 +2095,13 @@ typedef struct AggState
 										   memory in all hash tables */
 	uint64		hash_disk_used; /* kB of disk space used */
 	int			hash_batches_used;	/* batches used during entire execution */
-
-	AggStatePerHash perhash;	/* array of per-hashtable data */
-	AggStatePerGroup *hash_pergroup;	/* grouping set indexed array of
-										 * per-group pointers */
+	int			hash_agg_stage; /* hash aggregate stage, mainly for spill */
 
 	/* these fields are used in AGG_SORTED and AGG_MIXED */
 	bool		input_sorted;	/* hash table filled yet? */
+	int			eflags;			/* eflags for the first sort */
+
 
-	/* support for evaluation of agg input expressions: */
-#define FIELDNO_AGGSTATE_ALL_PERGROUPS 50
-	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
-										 * ->hash_pergroup */
 	ProjectionInfo *combinedproj;	/* projection machinery */
 } AggState;
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index 1cb9700..b29917c 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1004,10 +1004,10 @@ explain (costs off) select a, b, grouping(a,b), sum(v), count(*), max(v)
  Sort
    Sort Key: (GROUPING("*VALUES*".column1, "*VALUES*".column2)), "*VALUES*".column1, "*VALUES*".column2
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: "*VALUES*".column1, "*VALUES*".column2
          Hash Key: "*VALUES*".column1
          Hash Key: "*VALUES*".column2
-         Group Key: ()
          ->  Values Scan on "*VALUES*"
 (8 rows)
 
@@ -1066,9 +1066,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: unsortable_col
          Sort Key: unhashable_col
            Group Key: unhashable_col
+         Hash Key: unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1108,9 +1108,9 @@ explain (costs off)
  Sort
    Sort Key: (GROUPING(unhashable_col, unsortable_col)), (sum(v))
    ->  MixedAggregate
-         Hash Key: v, unsortable_col
          Sort Key: v, unhashable_col
            Group Key: v, unhashable_col
+         Hash Key: v, unsortable_col
          ->  Seq Scan on gstest4
 (7 rows)
 
@@ -1149,10 +1149,10 @@ explain (costs off)
            QUERY PLAN           
 --------------------------------
  MixedAggregate
-   Hash Key: a, b
    Group Key: ()
    Group Key: ()
    Group Key: ()
+   Hash Key: a, b
    ->  Seq Scan on gstest_empty
 (6 rows)
 
@@ -1310,10 +1310,10 @@ explain (costs off)
          ->  Sort
                Sort Key: a, b
                ->  MixedAggregate
+                     Group Key: ()
                      Hash Key: a, b
                      Hash Key: a
                      Hash Key: b
-                     Group Key: ()
                      ->  Seq Scan on gstest2
 (11 rows)
 
@@ -1345,10 +1345,10 @@ explain (costs off)
  Sort
    Sort Key: gstest_data.a, gstest_data.b
    ->  MixedAggregate
+         Group Key: ()
          Hash Key: gstest_data.a, gstest_data.b
          Hash Key: gstest_data.a
          Hash Key: gstest_data.b
-         Group Key: ()
          ->  Nested Loop
                ->  Values Scan on "*VALUES*"
                ->  Function Scan on gstest_data
@@ -1545,16 +1545,16 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
    Sort Key: thousand
      Group Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (12 rows)
 
@@ -1567,12 +1567,12 @@ explain (costs off)
        QUERY PLAN        
 -------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
    Sort Key: unique1
      Group Key: unique1
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (8 rows)
 
@@ -1586,15 +1586,15 @@ explain (costs off)
          QUERY PLAN         
 ----------------------------
  MixedAggregate
-   Hash Key: two
-   Hash Key: four
-   Hash Key: ten
-   Hash Key: hundred
-   Hash Key: thousand
    Sort Key: unique1
      Group Key: unique1
    Sort Key: twothousand
      Group Key: twothousand
+   Hash Key: thousand
+   Hash Key: hundred
+   Hash Key: ten
+   Hash Key: four
+   Hash Key: two
    ->  Seq Scan on tenk1
 (11 rows)
 
@@ -1671,6 +1671,7 @@ group by cube (g1000, g100,g10);
                     QUERY PLAN                     
 ---------------------------------------------------
  MixedAggregate
+   Group Key: ()
    Hash Key: (g.g % 1000), (g.g % 100), (g.g % 10)
    Hash Key: (g.g % 1000), (g.g % 100)
    Hash Key: (g.g % 1000)
@@ -1678,7 +1679,6 @@ group by cube (g1000, g100,g10);
    Hash Key: (g.g % 100)
    Hash Key: (g.g % 10), (g.g % 1000)
    Hash Key: (g.g % 10)
-   Group Key: ()
    ->  Function Scan on generate_series g
 (10 rows)
 
diff --git a/src/test/regress/expected/partition_aggregate.out b/src/test/regress/expected/partition_aggregate.out
index fbc8d3a..7818f02 100644
--- a/src/test/regress/expected/partition_aggregate.out
+++ b/src/test/regress/expected/partition_aggregate.out
@@ -340,8 +340,8 @@ SELECT c, sum(a) FROM pagg_tab GROUP BY rollup(c) ORDER BY 1, 2;
  Sort
    Sort Key: pagg_tab.c, (sum(pagg_tab.a))
    ->  MixedAggregate
-         Hash Key: pagg_tab.c
          Group Key: ()
+         Hash Key: pagg_tab.c
          ->  Append
                ->  Seq Scan on pagg_tab_p1 pagg_tab_1
                ->  Seq Scan on pagg_tab_p2 pagg_tab_2
-- 
1.8.3.1

0005-Parallel-grouping-sets.patchapplication/octet-stream; name=0005-Parallel-grouping-sets.patchDownload

From 4cfb7ed009b1123fdf5c6479c8dc33ea5b435542 Mon Sep 17 00:00:00 2001
From: Pengzhou Tang <ptang@pivotal.io>
Date: Wed, 11 Mar 2020 23:08:11 -0400
Subject: [PATCH 5/5] Parallel grouping sets
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

We used to support grouping sets in one worker only, this PR
want to support parallel grouping sets using multiple workers.

the main idea of parallel grouping sets is: like parallel aggregate,  we separate
grouping sets into two stages:

The initial stage: this stage has almost the same plan and execution routines
with the current implementation of grouping sets, the differenceis are 1) it
only produces partial aggregate results 2) the output is attached with an extra
grouping set id. We know partial aggregate results will be combined in the final
stage and we have multiple grouping sets, so only partial aggregate results
belong to the same grouping set can be combined, that is why grouping set id is
introduced to identify the sets. We keep all the optimizations of multiple
grouping sets in the initial stage, eg, 1) the grouping sets (that can be
grouped by one single sort) are put into the one rollup structure so those sets
arecomputed in one aggregate phase. 2) do hash aggregate concurrently when a
sort aggregate is performed. 3) do all hash transitions in one expression state.

The final stage: this stage combine the partial aggregate results according to
the grouping set id. Obviously, all the optimizations in the initial stage
cannot be used, so all rollups are extracted, each rollup contains only one
grouping set, then each aggregate phase only processes one set. We do a filter
in the final stage, redirect the tuples to each aggregate phase.
---
 src/backend/commands/explain.c             |  10 +-
 src/backend/executor/execExpr.c            |  10 +-
 src/backend/executor/execExprInterp.c      |  11 +
 src/backend/executor/nodeAgg.c             | 272 +++++++++++++++++++++--
 src/backend/jit/llvm/llvmjit_expr.c        |  40 ++++
 src/backend/nodes/copyfuncs.c              |  56 ++++-
 src/backend/nodes/equalfuncs.c             |   3 +
 src/backend/nodes/nodeFuncs.c              |   8 +
 src/backend/nodes/outfuncs.c               |  14 +-
 src/backend/nodes/readfuncs.c              |  53 ++++-
 src/backend/optimizer/path/allpaths.c      |   5 +-
 src/backend/optimizer/plan/createplan.c    |  25 +--
 src/backend/optimizer/plan/planner.c       | 334 ++++++++++++++++++++++-------
 src/backend/optimizer/plan/setrefs.c       |  16 ++
 src/backend/optimizer/util/pathnode.c      |  27 ++-
 src/backend/utils/adt/ruleutils.c          |   6 +
 src/include/executor/execExpr.h            |   1 +
 src/include/executor/nodeAgg.h             |   2 +
 src/include/nodes/execnodes.h              |   8 +-
 src/include/nodes/nodes.h                  |   1 +
 src/include/nodes/pathnodes.h              |   2 +
 src/include/nodes/plannodes.h              |   4 +-
 src/include/nodes/primnodes.h              |   6 +
 src/include/optimizer/pathnode.h           |   1 +
 src/include/optimizer/planmain.h           |   2 +-
 src/test/regress/expected/groupingsets.out | 112 ++++++++++
 src/test/regress/sql/groupingsets.sql      |  64 ++++++
 27 files changed, 968 insertions(+), 125 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7486d4b..fead66f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2258,12 +2258,16 @@ show_agg_keys(AggState *astate, List *ancestors,
 {
 	Agg		   *plan = (Agg *) astate->ss.ps.plan;
 
-	if (plan->numCols > 0 || plan->groupingSets)
+	if (plan->gsetid)
+		show_expression((Node *) plan->gsetid, "Filtered by",
+						(PlanState *) astate, ancestors, true, es);
+
+	if (plan->numCols > 0 || plan->rollup)
 	{
 		/* The key columns refer to the tlist of the child plan */
 		ancestors = lcons(plan, ancestors);
 
-		if (plan->groupingSets)
+		if (plan->rollup)
 			show_grouping_sets(outerPlanState(astate), plan, ancestors, es);
 		else
 			show_sort_group_keys(outerPlanState(astate), "Group Key",
@@ -2314,7 +2318,7 @@ show_grouping_set_keys(PlanState *planstate,
 	Plan	   *plan = planstate->plan;
 	char	   *exprstr;
 	ListCell   *lc;
-	List	   *gsets = aggnode->groupingSets;
+	List	   *gsets = aggnode->rollup->gsets;
 	AttrNumber *keycols = aggnode->grpColIdx;
 	const char *keyname;
 	const char *keysetname;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 3533f5c..4ed455d 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -815,7 +815,7 @@ ExecInitExprRec(Expr *node, ExprState *state,
 
 				agg = (Agg *) (state->parent->plan);
 
-				if (agg->groupingSets)
+				if (agg->rollup)
 					scratch.d.grouping_func.clauses = grp_node->cols;
 				else
 					scratch.d.grouping_func.clauses = NIL;
@@ -824,6 +824,14 @@ ExecInitExprRec(Expr *node, ExprState *state,
 				break;
 			}
 
+		case T_GroupingSetId:
+			{
+				scratch.opcode = EEOP_GROUPING_SET_ID;
+
+				ExprEvalPushStep(state, &scratch);
+				break;
+			}
+
 		case T_WindowFunc:
 			{
 				WindowFunc *wfunc = (WindowFunc *) node;
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index b0dbba4..b3537eb 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -428,6 +428,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_XMLEXPR,
 		&&CASE_EEOP_AGGREF,
 		&&CASE_EEOP_GROUPING_FUNC,
+		&&CASE_EEOP_GROUPING_SET_ID,
 		&&CASE_EEOP_WINDOW_FUNC,
 		&&CASE_EEOP_SUBPLAN,
 		&&CASE_EEOP_ALTERNATIVE_SUBPLAN,
@@ -1512,6 +1513,16 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_GROUPING_SET_ID)
+		{
+			AggState   *aggstate = castNode(AggState, state->parent);
+
+			*op->resvalue = aggstate->phase->setno_gsetids[aggstate->current_set];
+			*op->resnull = false;
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_WINDOW_FUNC)
 		{
 			/*
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3287ed4..e105e78 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -438,6 +438,7 @@ static TupleTableSlot *agg_retrieve_direct(AggState *aggstate);
 static void agg_fill_hash_table(AggState *aggstate);
 static bool agg_refill_hash_table(AggState *aggstate);
 static void agg_sort_input(AggState *aggstate);
+static void agg_preprocess_groupingsets(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table(AggState *aggstate);
 static TupleTableSlot *agg_retrieve_hash_table_in_memory(AggState *aggstate);
 static void hash_agg_check_limits(AggState *aggstate);
@@ -517,17 +518,26 @@ initialize_phase(AggState *aggstate, int newphase)
 	 * Whatever the previous state, we're now done with whatever input
 	 * tuplesort was in use, cleanup them.
 	 *
-	 * Note: we keep the first tuplesort/tuplestore, this will benifit the
+	 * Note: we keep the first tuplesort/tuplestore when it's not the
+	 * final stage of partial groupingsets, this will benifit the
 	 * rescan in some cases without resorting the input again.
 	 */
-	if (!current_phase->is_hashed && aggstate->current_phase > 0)
+	if (!current_phase->is_hashed &&
+		(aggstate->current_phase > 0 || DO_AGGSPLIT_COMBINE(aggstate->aggsplit)))
 	{
 		persort = (AggStatePerPhaseSort) current_phase;
+
 		if (persort->sort_in)
 		{
 			tuplesort_end(persort->sort_in);
 			persort->sort_in = NULL;
 		}
+
+		if (persort->store_in)
+		{
+			tuplestore_end(persort->store_in);
+			persort->store_in = NULL;
+		}
 	}
 
 	/* advance to next phase */
@@ -596,6 +606,15 @@ fetch_input_tuple(AggState *aggstate)
 			return NULL;
 		slot = aggstate->sort_slot;
 	}
+	else if (current_phase->store_in)
+	{
+		/* make sure we check for interrupts in either path through here */
+		CHECK_FOR_INTERRUPTS();
+		if (!tuplestore_gettupleslot(current_phase->store_in, true, false,
+									 aggstate->sort_slot))
+			return NULL;
+		slot = aggstate->sort_slot;
+	}
 	else
 		slot = ExecProcNode(outerPlanState(aggstate));
 
@@ -2172,6 +2191,9 @@ ExecAgg(PlanState *pstate)
 
 	CHECK_FOR_INTERRUPTS();
 
+	if (node->groupingsets_preprocess)
+		agg_preprocess_groupingsets(node);
+
 	if (!node->agg_done)
 	{
 		/* Dispatch based on strategy */
@@ -2212,7 +2234,7 @@ agg_retrieve_direct(AggState *aggstate)
 	TupleTableSlot *outerslot;
 	TupleTableSlot *firstSlot;
 	TupleTableSlot *result;
-	bool		hasGroupingSets = aggstate->phase->aggnode->groupingSets != NULL;
+	bool		hasGroupingSets = aggstate->phase->aggnode->rollup != NULL;
 	int			numGroupingSets = aggstate->phase->numsets;
 	int			currentSet;
 	int			nextSetSize;
@@ -2549,6 +2571,144 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+/*
+ * Routine for final phase of partial grouping sets:
+ *
+ * Preprocess tuples for final phase of grouping sets. In initial phase,
+ * tuples is decorated with a grouping set ID and in the final phase, all
+ * grouping set are fit into different aggregate phases, so we need to
+ * redirect the tuples to each aggregate phases according to the grouping
+ * set ID.
+ */
+static void
+agg_preprocess_groupingsets(AggState *aggstate)
+{
+	AggStatePerPhaseSort	persort;
+	AggStatePerPhaseHash	perhash;
+	AggStatePerPhase	phase;
+	TupleTableSlot		*outerslot;
+	ExprContext			*tmpcontext = aggstate->tmpcontext;
+	int					phaseidx;
+
+	Assert(DO_AGGSPLIT_COMBINE(aggstate->aggsplit));
+	Assert(aggstate->groupingsets_preprocess);
+
+	/* Initialize tuples storage for each aggregate phases */
+	for (phaseidx = 0; phaseidx < aggstate->numphases; phaseidx++)
+	{
+		phase = aggstate->phases[phaseidx];
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+			if (phase->aggnode->sortnode)
+			{
+				Sort	   *sortnode = (Sort *) phase->aggnode->sortnode;
+				PlanState  *outerNode = outerPlanState(aggstate);
+				TupleDesc	tupDesc = ExecGetResultType(outerNode);
+
+				persort->sort_in = tuplesort_begin_heap(tupDesc,
+														sortnode->numCols,
+														sortnode->sortColIdx,
+														sortnode->sortOperators,
+														sortnode->collations,
+														sortnode->nullsFirst,
+														work_mem,
+														NULL, false);
+			}
+			else
+			{
+				persort->store_in = tuplestore_begin_heap(false, false, work_mem);
+			}
+		}
+		else
+		{
+			/*
+			 * If it's a AGG_HASHED, we don't need a storage to store
+			 * the tuples for later process, we can do the transition
+			 * immediately.
+			 */
+		}
+	}
+
+	for (;;)
+	{
+		Datum	ret;
+		bool	isNull;
+		int		setid;
+
+		outerslot = ExecProcNode(outerPlanState(aggstate));
+		if (TupIsNull(outerslot))
+			break;
+
+		tmpcontext->ecxt_outertuple = outerslot;
+
+		/* Finger out which group set the tuple belongs to ?*/
+		ret = ExecEvalExprSwitchContext(aggstate->gsetid, tmpcontext, &isNull);
+
+		setid = DatumGetInt32(ret);
+		phase = aggstate->phases[aggstate->gsetid_phaseidxs[setid]];
+
+		if (!phase->is_hashed)
+		{
+			persort = (AggStatePerPhaseSort) phase;
+
+			Assert(persort->sort_in || persort->store_in);
+
+			if (persort->sort_in)
+				tuplesort_puttupleslot(persort->sort_in, outerslot);
+			else if (persort->store_in)
+				tuplestore_puttupleslot(persort->store_in, outerslot);
+		}
+		else
+		{
+			perhash = (AggStatePerPhaseHash) phase;
+
+			/* If it is hashed, we can do the transition now. */
+			aggstate->current_phase = phase->phaseidx;
+			aggstate->phase = phase;
+			select_current_set(aggstate, 0, true);
+			hashagg_recompile_expressions(aggstate);
+
+			lookup_hash_entries(aggstate, perhash, NIL);
+			/* Do the transition */
+			advance_aggregates(aggstate);
+
+			/* Change current phase back to phase 0 */
+			aggstate->current_phase = 0;
+			aggstate->phase = aggstate->phases[0];
+		}
+
+		ResetExprContext(aggstate->tmpcontext);
+	}
+
+	/* Sort the first phase if needed */
+	if (aggstate->aggstrategy != AGG_HASHED)
+	{
+		persort = (AggStatePerPhaseSort) aggstate->phase;
+
+		if (persort->sort_in)
+			tuplesort_performsort(persort->sort_in);
+	}
+	else
+	{
+		/*
+		 * If we built hash tables, finalize any spills,
+		 * AGG_MIXED will finalize the spills later
+		 */
+		hashagg_finish_initial_spills(aggstate);
+	}
+
+	/* Mark the hash table to be filled */
+	aggstate->table_filled = true;
+
+	/* Mark the input table to be sorted */
+	aggstate->input_sorted = true;
+
+	/* Mark the flag to not preprocessing groupingsets again */
+	aggstate->groupingsets_preprocess = false;
+}
+
 static void
 agg_sort_input(AggState *aggstate)
 {
@@ -3297,21 +3457,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->eflags = eflags;
 	aggstate->num_hashes = 0;
 	aggstate->hash_agg_stage = HASHAGG_INITIAL;
+	aggstate->groupingsets_preprocess = false;
 
 	/*
 	 * Calculate the maximum number of grouping sets in any phase; this
 	 * determines the size of some allocations.
 	 */
-	if (node->groupingSets)
+	if (node->rollup)
 	{
-		numGroupingSets = list_length(node->groupingSets);
+		numGroupingSets = list_length(node->rollup->gsets);
 
 		foreach(l, node->chain)
 		{
 			Agg		   *agg = lfirst(l);
 
 			numGroupingSets = Max(numGroupingSets,
-								  list_length(agg->groupingSets));
+								  list_length(agg->rollup->gsets));
 
 			if (agg->aggstrategy != AGG_HASHED)
 				need_extra_slot = true;
@@ -3322,11 +3483,33 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 	aggstate->numphases = 1 + list_length(node->chain);
 
 	/*
+	 * We are doing final stage of partial groupingsets, do preprocess
+	 * to input tuples first, redirect the tuples to according aggregate
+	 * phases. See agg_preprocess_groupingsets().
+	 */
+	if (node->rollup && DO_AGGSPLIT_COMBINE(aggstate->aggsplit))
+	{
+		aggstate->groupingsets_preprocess = true;
+
+		/*
+		 * Allocate gsetid <-> phases mapping, in final stage of
+		 * partial groupingsets, all grouping sets are extracted
+		 * to individual phases, so the number of sets is equal
+		 * to the number of phases
+		 */
+		aggstate->gsetid_phaseidxs =
+			(int *) palloc0(aggstate->numphases * sizeof(int));
+
+		if (aggstate->aggstrategy != AGG_HASHED)
+			need_extra_slot = true;
+	}
+
+	/*
 	 * The first phase is not sorted, agg need to do its own sort. See
 	 * agg_sort_input(), this can only happen in groupingsets case.
 	 */
 	if (node->sortnode)
-		aggstate->input_sorted = false;	
+		aggstate->input_sorted = false;
 
 	aggstate->aggcontexts = (ExprContext **)
 		palloc0(sizeof(ExprContext *) * numGroupingSets);
@@ -3436,6 +3619,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 		ExecInitQual(node->plan.qual, (PlanState *) aggstate);
 
 	/*
+	 * Initialize expression state to fetch grouping set id from
+	 * the partial groupingsets aggregate result.
+	 */
+	aggstate->gsetid =
+		ExecInitExpr(node->gsetid, (PlanState *)aggstate);
+	/*
 	 * We should now have found all Aggrefs in the targetlist and quals.
 	 */
 	numaggs = aggstate->numaggs;
@@ -3484,6 +3673,22 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			all_grouped_cols = bms_add_members(all_grouped_cols, cols);
 
 			/*
+			 * In the initial stage of partial grouping sets, it provides extra
+			 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+			 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+			 * the output.
+			 */
+			if (aggnode->rollup &&
+				DO_AGGSPLIT_SERIALIZE(aggnode->aggsplit))
+			{
+				GroupingSetData	*gs;
+				phasedata->setno_gsetids = palloc(sizeof(int));
+				gs = linitial_node(GroupingSetData,
+								   aggnode->rollup->gsets_data);
+				phasedata->setno_gsetids[0] = gs->setId;
+			}
+
+			/*
 			 * Initialize pergroup state. For AGG_HASHED, all groups do transition
 			 * on the fly, all pergroup states are kept in hashtable, everytime
 			 * a tuple is processed, lookup_hash_entry() choose one group and
@@ -3501,8 +3706,12 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * we can do the transition immediately when a tuple is fetched,
 			 * which means we can do the transition concurrently with the
 			 * first phase.
+			 *
+			 * Note: this is not work for final phase of partial groupingsets in
+			 * which the partial input tuple has a specified target aggregate
+			 * phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				aggstate->phases[0]->concurrent_hashes =
 					lappend(aggstate->phases[0]->concurrent_hashes, perhash);
@@ -3520,17 +3729,19 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			phasedata->aggnode = aggnode;
 			phasedata->aggstrategy = aggnode->aggstrategy;
 
-			if (aggnode->groupingSets)
+			if (aggnode->rollup)
 			{
-				phasedata->numsets = list_length(aggnode->groupingSets);
+				phasedata->numsets = list_length(aggnode->rollup->gsets_data);
 				phasedata->gset_lengths = palloc(phasedata->numsets * sizeof(int));
 				phasedata->grouped_cols = palloc(phasedata->numsets * sizeof(Bitmapset *));
+				phasedata->setno_gsetids = palloc(phasedata->numsets * sizeof(int));
 
 				i = 0;
-				foreach(l, aggnode->groupingSets)
+				foreach(l, aggnode->rollup->gsets_data)
 				{
-					int		current_length = list_length(lfirst(l));
-					Bitmapset	*cols = NULL;
+					GroupingSetData *gs = lfirst_node(GroupingSetData, l);
+					int	current_length = list_length(gs->set);
+					Bitmapset *cols = NULL;
 
 					/* planner forces this to be correct */
 					for (j = 0; j < current_length; ++j)
@@ -3539,6 +3750,15 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 					phasedata->grouped_cols[i] = cols;
 					phasedata->gset_lengths[i] = current_length;
 
+					/*
+					 * In the initial stage of partial grouping sets, it provides extra
+					 * grouping sets ID in the targetlist, fill the setno <-> gsetid
+					 * map, so EEOP_GROUPING_SET_ID can evaluate correct gsetid for
+					 * the output.
+					 */
+					if (DO_AGGSPLIT_SERIALIZE(aggstate->aggsplit))
+						phasedata->setno_gsetids[i] = gs->setId;
+
 					++i;
 				}
 
@@ -3615,8 +3835,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			 * For non-first AGG_SORTED phase, it processes the same input
 			 * tuples with previous phase except that it need to resort the
 			 * input tuples. Tell the previous phase to copy out the tuples.
+			 *
+			 * Note: it doesn't work for final stage of partial grouping sets
+			 * in which tuple has specified target aggregate phase.
 			 */
-			if (phaseidx > 0)
+			if (phaseidx > 0 && !aggstate->groupingsets_preprocess)
 			{
 				AggStatePerPhaseSort prev =
 					(AggStatePerPhaseSort) aggstate->phases[phaseidx - 1];
@@ -3627,6 +3850,18 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			}
 		}
 
+		/*
+		 * Fill the gsetid_phaseidxs array, so we can find according phases
+		 * using gsetid.
+		 */
+		if (aggstate->groupingsets_preprocess)
+		{
+			GroupingSetData *gs =
+				linitial_node(GroupingSetData, aggnode->rollup->gsets_data);
+
+			aggstate->gsetid_phaseidxs[gs->setId] = phaseidx;
+		}
+
 		phasedata->phaseidx = phaseidx;
 		aggstate->phases[phaseidx] = phasedata;
 	}
@@ -4540,6 +4775,8 @@ ExecEndAgg(AggState *node)
 		persort = (AggStatePerPhaseSort) phase;
 		if (persort->sort_in)
 			tuplesort_end(persort->sort_in);
+		if (persort->store_in)
+			tuplestore_end(persort->store_in);
 	}
 
 	hashagg_reset_spill_state(node);
@@ -4740,6 +4977,13 @@ ExecReScanAgg(AggState *node)
 			}
 		}
 
+		/*
+		 * if the agg is doing final stage of partial groupingsets, reset the
+		 * flag to do groupingsets preprocess again.
+		 */
+		if (aggnode->rollup && DO_AGGSPLIT_COMBINE(node->aggsplit))
+			node->groupingsets_preprocess = true;
+
 		/* reset to phase 0 */
 		initialize_phase(node, 0);
 
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 066cd59..f70eaab 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -1882,6 +1882,46 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_GROUPING_SET_ID:
+				{
+					LLVMValueRef v_resvalue;
+					LLVMValueRef v_aggstatep;
+					LLVMValueRef v_phase;
+					LLVMValueRef v_current_set;
+					LLVMValueRef v_setno_gsetids;
+
+					v_aggstatep =
+						LLVMBuildBitCast(b, v_parent, l_ptr(StructAggState), "");
+
+					/*
+					 * op->resvalue =
+					 * aggstate->phase->setno_gsetids
+					 * [aggstate->current_set]
+					 */
+					v_phase =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_PHASE,
+										  "aggstate.phase");
+					v_setno_gsetids =
+						l_load_struct_gep(b, v_phase,
+										  FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS,
+										  "aggstateperphase.setno_gsetids");
+					v_current_set =
+						l_load_struct_gep(b, v_aggstatep,
+										  FIELDNO_AGGSTATE_CURRENT_SET,
+										  "aggstate.current_set");
+					v_resvalue =
+						l_load_gep1(b, v_setno_gsetids, v_current_set, "");
+					v_resvalue =
+						LLVMBuildZExt(b, v_resvalue, TypeSizeT, "");
+
+					LLVMBuildStore(b, v_resvalue, v_resvaluep);
+					LLVMBuildStore(b, l_sbool_const(0), v_resnullp);
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_WINDOW_FUNC:
 				{
 					WindowFuncExprState *wfunc = op->d.window_func.wfstate;
diff --git a/src/backend/nodes/copyfuncs.c b/src/backend/nodes/copyfuncs.c
index 04b4c65..de4dcfe 100644
--- a/src/backend/nodes/copyfuncs.c
+++ b/src/backend/nodes/copyfuncs.c
@@ -990,8 +990,9 @@ _copyAgg(const Agg *from)
 	COPY_SCALAR_FIELD(numGroups);
 	COPY_SCALAR_FIELD(transitionSpace);
 	COPY_BITMAPSET_FIELD(aggParams);
-	COPY_NODE_FIELD(groupingSets);
+	COPY_NODE_FIELD(rollup);
 	COPY_NODE_FIELD(chain);
+	COPY_NODE_FIELD(gsetid);
 	COPY_NODE_FIELD(sortnode);
 
 	return newnode;
@@ -1479,6 +1480,50 @@ _copyGroupingFunc(const GroupingFunc *from)
 }
 
 /*
+ * _copyGroupingSetId
+ */
+static GroupingSetId *
+_copyGroupingSetId(const GroupingSetId *from)
+{
+	GroupingSetId *newnode = makeNode(GroupingSetId);
+
+	return newnode;
+}
+
+/*
+ * _copyRollupData
+ */
+static RollupData*
+_copyRollupData(const RollupData *from)
+{
+	RollupData *newnode = makeNode(RollupData);
+
+	COPY_NODE_FIELD(groupClause);
+	COPY_NODE_FIELD(gsets);
+	COPY_NODE_FIELD(gsets_data);
+	COPY_SCALAR_FIELD(numGroups);
+	COPY_SCALAR_FIELD(hashable);
+	COPY_SCALAR_FIELD(is_hashed);
+
+	return newnode;
+}
+
+/*
+ * _copyGroupingSetData
+ */
+static GroupingSetData *
+_copyGroupingSetData(const GroupingSetData *from)
+{
+	GroupingSetData *newnode = makeNode(GroupingSetData);
+
+	COPY_NODE_FIELD(set);
+	COPY_SCALAR_FIELD(setId);
+	COPY_SCALAR_FIELD(numGroups);
+
+	return newnode;
+}
+
+/*
  * _copyWindowFunc
  */
 static WindowFunc *
@@ -4972,6 +5017,9 @@ copyObjectImpl(const void *from)
 		case T_GroupingFunc:
 			retval = _copyGroupingFunc(from);
 			break;
+		case T_GroupingSetId:
+			retval = _copyGroupingSetId(from);
+			break;
 		case T_WindowFunc:
 			retval = _copyWindowFunc(from);
 			break;
@@ -5608,6 +5656,12 @@ copyObjectImpl(const void *from)
 		case T_SortGroupClause:
 			retval = _copySortGroupClause(from);
 			break;
+		case T_RollupData:
+			retval = _copyRollupData(from);
+			break;
+		case T_GroupingSetData:
+			retval = _copyGroupingSetData(from);
+			break;
 		case T_GroupingSet:
 			retval = _copyGroupingSet(from);
 			break;
diff --git a/src/backend/nodes/equalfuncs.c b/src/backend/nodes/equalfuncs.c
index 88b9129..6aa71d3 100644
--- a/src/backend/nodes/equalfuncs.c
+++ b/src/backend/nodes/equalfuncs.c
@@ -3078,6 +3078,9 @@ equal(const void *a, const void *b)
 		case T_GroupingFunc:
 			retval = _equalGroupingFunc(a, b);
 			break;
+		case T_GroupingSetId:
+			retval = true;
+			break;
 		case T_WindowFunc:
 			retval = _equalWindowFunc(a, b);
 			break;
diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index d85ca9f..877ea0b 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -62,6 +62,9 @@ exprType(const Node *expr)
 		case T_GroupingFunc:
 			type = INT4OID;
 			break;
+		case T_GroupingSetId:
+			type = INT4OID;
+			break;
 		case T_WindowFunc:
 			type = ((const WindowFunc *) expr)->wintype;
 			break;
@@ -740,6 +743,9 @@ exprCollation(const Node *expr)
 		case T_GroupingFunc:
 			coll = InvalidOid;
 			break;
+		case T_GroupingSetId:
+			coll = InvalidOid;
+			break;
 		case T_WindowFunc:
 			coll = ((const WindowFunc *) expr)->wincollid;
 			break;
@@ -1869,6 +1875,7 @@ expression_tree_walker(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			/* primitive node types with no expression subnodes */
 			break;
 		case T_WithCheckOption:
@@ -2575,6 +2582,7 @@ expression_tree_mutator(Node *node,
 		case T_NextValueExpr:
 		case T_RangeTblRef:
 		case T_SortGroupClause:
+		case T_GroupingSetId:
 			return (Node *) copyObject(node);
 		case T_WithCheckOption:
 			{
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 5816d12..efcb1c7 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -785,8 +785,9 @@ _outAgg(StringInfo str, const Agg *node)
 	WRITE_LONG_FIELD(numGroups);
 	WRITE_UINT64_FIELD(transitionSpace);
 	WRITE_BITMAPSET_FIELD(aggParams);
-	WRITE_NODE_FIELD(groupingSets);
+	WRITE_NODE_FIELD(rollup);
 	WRITE_NODE_FIELD(chain);
+	WRITE_NODE_FIELD(gsetid);
 	WRITE_NODE_FIELD(sortnode);
 }
 
@@ -1151,6 +1152,13 @@ _outGroupingFunc(StringInfo str, const GroupingFunc *node)
 }
 
 static void
+_outGroupingSetId(StringInfo str,
+				  const GroupingSetId *node __attribute__((unused)))
+{
+	WRITE_NODE_TYPE("GROUPINGSETID");
+}
+
+static void
 _outWindowFunc(StringInfo str, const WindowFunc *node)
 {
 	WRITE_NODE_TYPE("WINDOWFUNC");
@@ -2002,6 +2010,7 @@ _outGroupingSetData(StringInfo str, const GroupingSetData *node)
 	WRITE_NODE_TYPE("GSDATA");
 
 	WRITE_NODE_FIELD(set);
+	WRITE_INT_FIELD(setId);
 	WRITE_FLOAT_FIELD(numGroups, "%.0f");
 }
 
@@ -3847,6 +3856,9 @@ outNode(StringInfo str, const void *obj)
 			case T_GroupingFunc:
 				_outGroupingFunc(str, obj);
 				break;
+			case T_GroupingSetId:
+				_outGroupingSetId(str, obj);
+				break;
 			case T_WindowFunc:
 				_outWindowFunc(str, obj);
 				break;
diff --git a/src/backend/nodes/readfuncs.c b/src/backend/nodes/readfuncs.c
index af4fcfe..c9a3340 100644
--- a/src/backend/nodes/readfuncs.c
+++ b/src/backend/nodes/readfuncs.c
@@ -637,6 +637,50 @@ _readGroupingFunc(void)
 }
 
 /*
+ * _readGroupingSetId
+ */
+static GroupingSetId *
+_readGroupingSetId(void)
+{
+	READ_LOCALS_NO_FIELDS(GroupingSetId);
+
+	READ_DONE();
+}
+
+/*
+ * _readRollupData
+ */
+static RollupData *
+_readRollupData(void)
+{
+	READ_LOCALS(RollupData);
+
+	READ_NODE_FIELD(groupClause);
+	READ_NODE_FIELD(gsets);
+	READ_NODE_FIELD(gsets_data);
+	READ_FLOAT_FIELD(numGroups);
+	READ_BOOL_FIELD(hashable);
+	READ_BOOL_FIELD(is_hashed);
+
+	READ_DONE();
+}
+
+/*
+ * _readGroupingSetData
+ */
+static GroupingSetData *
+_readGroupingSetData(void)
+{
+	READ_LOCALS(GroupingSetData);
+
+	READ_NODE_FIELD(set);
+	READ_INT_FIELD(setId);
+	READ_FLOAT_FIELD(numGroups);
+
+	READ_DONE();
+}
+
+/*
  * _readWindowFunc
  */
 static WindowFunc *
@@ -2205,8 +2249,9 @@ _readAgg(void)
 	READ_LONG_FIELD(numGroups);
 	READ_UINT64_FIELD(transitionSpace);
 	READ_BITMAPSET_FIELD(aggParams);
-	READ_NODE_FIELD(groupingSets);
+	READ_NODE_FIELD(rollup);
 	READ_NODE_FIELD(chain);
+	READ_NODE_FIELD(gsetid);
 	READ_NODE_FIELD(sortnode);
 
 	READ_DONE();
@@ -2642,6 +2687,12 @@ parseNodeString(void)
 		return_value = _readAggref();
 	else if (MATCH("GROUPINGFUNC", 12))
 		return_value = _readGroupingFunc();
+	else if (MATCH("GROUPINGSETID", 13))
+		return_value = _readGroupingSetId();
+	else if (MATCH("ROLLUP", 6))
+		return_value = _readRollupData();
+	else if (MATCH("GSDATA", 6))
+		return_value = _readGroupingSetData();
 	else if (MATCH("WINDOWFUNC", 10))
 		return_value = _readWindowFunc();
 	else if (MATCH("SUBSCRIPTINGREF", 15))
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 905bbe7..e6c7f08 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2710,8 +2710,11 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo *rel, bool override_rows)
 
 	/*
 	 * For each useful ordering, we can consider an order-preserving Gather
-	 * Merge.
+	 * Merge. Don't do this for partial groupingsets.
 	 */
+	if (root->parse->groupingSets)
+		return;
+
 	foreach(lc, rel->partial_pathlist)
 	{
 		Path	   *subpath = (Path *) lfirst(lc);
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index e9ad5a9..db88222 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1641,7 +1641,7 @@ create_unique_plan(PlannerInfo *root, UniquePath *best_path, int flags)
 								 groupColIdx,
 								 groupOperators,
 								 groupCollations,
-								 NIL,
+								 NULL,
 								 NIL,
 								 best_path->path.rows,
 								 0,
@@ -2095,7 +2095,7 @@ create_agg_plan(PlannerInfo *root, AggPath *best_path)
 					extract_grouping_ops(best_path->groupClause),
 					extract_grouping_collations(best_path->groupClause,
 												subplan->targetlist),
-					NIL,
+					NULL,
 					NIL,
 					best_path->numGroups,
 					best_path->transitionSpace,
@@ -2214,7 +2214,6 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	 * never be grouping in an UPDATE/DELETE; but let's Assert that.
 	 */
 	Assert(root->inhTargetKind == INHKIND_NONE);
-	Assert(root->grouping_map == NULL);
 	root->grouping_map = grouping_map;
 
 	/*
@@ -2241,7 +2240,9 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			 * node if the input is not sorted yet, for other rollups using
 			 * sorted mode, always add an explicit sort.
 			 */
-			if (!rollup->is_hashed)
+			/* In final stage, rollup may contain empty set here */
+			if (!rollup->is_hashed &&
+				list_length(linitial(rollup->gsets)) != 0)
 			{
 				sort_plan = (Plan *)
 					make_sort_from_groupcols(rollup->groupClause,
@@ -2265,12 +2266,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 			agg_plan = (Plan *) make_agg(NIL,
 										 NIL,
 										 strat,
-										 AGGSPLIT_SIMPLE,
+										 best_path->aggsplit,
 										 list_length((List *) linitial(rollup->gsets)),
 										 new_grpColIdx,
 										 extract_grouping_ops(rollup->groupClause),
 										 extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-										 rollup->gsets,
+										 rollup,
 										 NIL,
 										 rollup->numGroups,
 										 best_path->transitionSpace,
@@ -2282,8 +2283,7 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 	}
 
 	/*
-	 * Now make the real Agg node
-	 */
+	 * Now make the real Agg node */
 	{
 		RollupData *rollup = linitial(rollups);
 		AttrNumber *top_grpColIdx;
@@ -2315,12 +2315,12 @@ create_groupingsets_plan(PlannerInfo *root, GroupingSetsPath *best_path)
 		plan = make_agg(build_path_tlist(root, &best_path->path),
 						best_path->qual,
 						best_path->aggstrategy,
-						AGGSPLIT_SIMPLE,
+						best_path->aggsplit,
 						numGroupCols,
 						top_grpColIdx,
 						extract_grouping_ops(rollup->groupClause),
 						extract_grouping_collations(rollup->groupClause, subplan->targetlist),
-						rollup->gsets,
+						rollup,
 						chain,
 						rollup->numGroups,
 						best_path->transitionSpace,
@@ -6222,7 +6222,7 @@ Agg *
 make_agg(List *tlist, List *qual,
 		 AggStrategy aggstrategy, AggSplit aggsplit,
 		 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-		 List *groupingSets, List *chain, double dNumGroups,
+		 RollupData *rollup, List *chain, double dNumGroups,
 		 Size transitionSpace, Plan *sortnode, Plan *lefttree)
 {
 	Agg		   *node = makeNode(Agg);
@@ -6241,8 +6241,9 @@ make_agg(List *tlist, List *qual,
 	node->numGroups = numGroups;
 	node->transitionSpace = transitionSpace;
 	node->aggParams = NULL;		/* SS_finalize_plan() will fill this */
-	node->groupingSets = groupingSets;
+	node->rollup= rollup;
 	node->chain = chain;
+	node->gsetid = NULL;
 	node->sortnode = sortnode;
 
 	plan->qual = qual;
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 2b2391b..8d5c41b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -113,6 +113,7 @@ typedef struct
 	Bitmapset  *unhashable_refs;
 	List	   *unsortable_sets;
 	int		   *tleref_to_colnum_map;
+	int		   num_sets;
 } grouping_sets_data;
 
 /*
@@ -126,6 +127,8 @@ typedef struct
 								 * clauses per Window */
 } WindowClauseSortData;
 
+typedef void (*AddPathCallback) (RelOptInfo *parent_rel, Path *new_path);
+
 /* Local functions */
 static Node *preprocess_expression(PlannerInfo *root, Node *expr, int kind);
 static void preprocess_qual_conditions(PlannerInfo *root, Node *jtnode);
@@ -142,7 +145,8 @@ static double preprocess_limit(PlannerInfo *root,
 static void remove_useless_groupby_columns(PlannerInfo *root);
 static List *preprocess_groupclause(PlannerInfo *root, List *force);
 static List *extract_rollup_sets(List *groupingSets);
-static List *reorder_grouping_sets(List *groupingSets, List *sortclause);
+static List *reorder_grouping_sets(grouping_sets_data *gd,
+								   List *groupingSets, List *sortclause);
 static void standard_qp_callback(PlannerInfo *root, void *extra);
 static double get_number_of_groups(PlannerInfo *root,
 								   double path_rows,
@@ -176,7 +180,11 @@ static void consider_groupingsets_paths(PlannerInfo *root,
 										grouping_sets_data *gd,
 										const AggClauseCosts *agg_costs,
 										double dNumGroups,
-										AggStrategy strat);
+										List *havingQual,
+										AggStrategy strat,
+										AggSplit aggsplit,
+										AddPathCallback add_path_fn);
+
 static RelOptInfo *create_window_paths(PlannerInfo *root,
 									   RelOptInfo *input_rel,
 									   PathTarget *input_target,
@@ -250,6 +258,9 @@ static bool group_by_has_partkey(RelOptInfo *input_rel,
 								 List *groupClause);
 static int	common_prefix_cmp(const void *a, const void *b);
 
+static List *extract_final_rollups(PlannerInfo *root,
+								   grouping_sets_data *gd,
+								   List *rollups);
 
 /*****************************************************************************
  *
@@ -2491,6 +2502,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 				GroupingSetData *gs = makeNode(GroupingSetData);
 
 				gs->set = gset;
+				gs->setId = gd->num_sets++;
 				gd->unsortable_sets = lappend(gd->unsortable_sets, gs);
 
 				/*
@@ -2535,7 +2547,7 @@ preprocess_grouping_sets(PlannerInfo *root)
 		 * largest-member-first, and applies the GroupingSetData annotations,
 		 * though the data will be filled in later.
 		 */
-		current_sets = reorder_grouping_sets(current_sets,
+		current_sets = reorder_grouping_sets(gd, current_sets,
 											 (list_length(sets) == 1
 											  ? parse->sortClause
 											  : NIL));
@@ -3544,7 +3556,7 @@ extract_rollup_sets(List *groupingSets)
  * gets implemented in one pass.)
  */
 static List *
-reorder_grouping_sets(List *groupingsets, List *sortclause)
+reorder_grouping_sets(grouping_sets_data *gd, List *groupingsets, List *sortclause)
 {
 	ListCell   *lc;
 	List	   *previous = NIL;
@@ -3578,6 +3590,7 @@ reorder_grouping_sets(List *groupingsets, List *sortclause)
 		previous = list_concat(previous, new_elems);
 
 		gs->set = list_copy(previous);
+		gs->setId = gd->num_sets++;
 		result = lcons(gs, result);
 	}
 
@@ -4211,9 +4224,11 @@ consider_groupingsets_paths(PlannerInfo *root,
 							grouping_sets_data *gd,
 							const AggClauseCosts *agg_costs,
 							double dNumGroups,
-							AggStrategy strat)
+							List *havingQual,
+							AggStrategy strat,
+							AggSplit aggsplit,
+							AddPathCallback add_path_fn)
 {
-	Query	   *parse = root->parse;
 	Assert(strat == AGG_HASHED || strat == AGG_SORTED);
 
 	/*
@@ -4378,16 +4393,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 			strat = AGG_MIXED;
 		}
 
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  strat,
-										  new_rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			new_rollups = extract_final_rollups(root, gd, new_rollups);
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 strat,
+											 new_rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
 		return;
 	}
 
@@ -4399,7 +4418,7 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 	/*
 	 * Callers consider AGG_SORTED strategy, the first rollup must
-	 * use non-hashed aggregate, 'is_sorted' tells whether the first
+	 * use non-hashed aggregate, is_sorted tells whether the first
 	 * rollup need to do its own sort.
 	 *
 	 * we try and make two paths: one sorted and one mixed
@@ -4544,16 +4563,20 @@ consider_groupingsets_paths(PlannerInfo *root,
 
 		if (rollups)
 		{
-			add_path(grouped_rel, (Path *)
-					 create_groupingsets_path(root,
-											  grouped_rel,
-											  path,
-											  (List *) parse->havingQual,
-											  AGG_MIXED,
-											  rollups,
-											  agg_costs,
-											  dNumGroups,
-											  is_sorted));
+			if (DO_AGGSPLIT_COMBINE(aggsplit))
+				rollups = extract_final_rollups(root, gd, rollups);
+
+			add_path_fn(grouped_rel, (Path *)
+						create_groupingsets_path(root,
+												 grouped_rel,
+												 path,
+												 havingQual,
+												 AGG_MIXED,
+												 rollups,
+												 agg_costs,
+												 dNumGroups,
+												 aggsplit,
+												 is_sorted));
 		}
 	}
 
@@ -4561,16 +4584,82 @@ consider_groupingsets_paths(PlannerInfo *root,
 	 * Now try the simple sorted case.
 	 */
 	if (!gd->unsortable_sets)
-		add_path(grouped_rel, (Path *)
-				 create_groupingsets_path(root,
-										  grouped_rel,
-										  path,
-										  (List *) parse->havingQual,
-										  AGG_SORTED,
-										  gd->rollups,
-										  agg_costs,
-										  dNumGroups,
-										  is_sorted));
+	{
+		List *rollups;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rollups = extract_final_rollups(root, gd, gd->rollups);
+		else
+			rollups = gd->rollups;
+
+		add_path_fn(grouped_rel, (Path *)
+					create_groupingsets_path(root,
+											 grouped_rel,
+											 path,
+											 havingQual,
+											 AGG_SORTED,
+											 rollups,
+											 agg_costs,
+											 dNumGroups,
+											 aggsplit,
+											 is_sorted));
+	}
+}
+
+/*
+ * If we are combining the partial groupingsets aggregation, the input is
+ * mixed with tuples from different grouping sets, executor dispatch the
+ * tuples to different rollups (phases) according to the grouping set id.
+ *
+ * We cannot use the same rollups with initial stage in which each tuple
+ * is processed by one or more grouping sets in one rollup, because in
+ * combining stage, each tuple only belong to one single grouping set.
+ * In this case, we use final_rollups instead in which each rollup has
+ * only one grouping set.
+ */
+static List *
+extract_final_rollups(PlannerInfo *root,
+					  grouping_sets_data *gd,
+					  List *rollups)
+{
+	ListCell	*lc;
+	List		*new_rollups = NIL;
+
+	foreach(lc, rollups)
+	{
+		ListCell	*lc1;
+		RollupData	*rollup = lfirst_node(RollupData, lc);
+
+		foreach(lc1, rollup->gsets_data)
+		{
+			GroupingSetData *gs = lfirst_node(GroupingSetData, lc1);
+			RollupData *new_rollup = makeNode(RollupData);
+
+			if (gs->set != NIL)
+			{
+				new_rollup->groupClause = preprocess_groupclause(root, gs->set);
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = remap_to_groupclause_idx(new_rollup->groupClause,
+															 new_rollup->gsets_data,
+															 gd->tleref_to_colnum_map);
+				new_rollup->hashable = rollup->hashable;
+				new_rollup->is_hashed = rollup->is_hashed;
+			}
+			else
+			{
+				new_rollup->groupClause = NIL;
+				new_rollup->gsets_data = list_make1(gs);
+				new_rollup->gsets = list_make1(NIL);
+				new_rollup->hashable = false;
+				new_rollup->is_hashed = false;
+			}
+
+			new_rollup->numGroups = gs->numGroups;
+			new_rollups = lappend(new_rollups, new_rollup);
+		}
+	}
+
+	return new_rollups;
 }
 
 /*
@@ -5281,6 +5370,17 @@ make_partial_grouping_target(PlannerInfo *root,
 	add_new_columns_to_pathtarget(partial_target, non_group_exprs);
 
 	/*
+	 * We are generate partial groupingsets path, add an expression to show
+	 * the grouping set ID for a tuple, so in the final stage, executor can
+	 * know which set this tuple belongs to and combine them.
+	 * */
+	if (parse->groupingSets)
+	{
+		GroupingSetId *expr = makeNode(GroupingSetId);
+		add_new_column_to_pathtarget(partial_target, (Expr *)expr);
+	}
+
+	/*
 	 * Adjust Aggrefs to put them in partial mode.  At this point all Aggrefs
 	 * are at the top level of the target list, so we can just scan the list
 	 * rather than recursing through the expression trees.
@@ -6455,7 +6555,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 					consider_groupingsets_paths(root, grouped_rel,
 												path, is_sorted, can_hash,
 												gd, agg_costs, dNumGroups,
-												AGG_SORTED);
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_SIMPLE,
+												add_path);
 					continue;
 				}
 
@@ -6516,15 +6619,37 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			foreach(lc, partially_grouped_rel->pathlist)
 			{
 				Path	   *path = (Path *) lfirst(lc);
+				bool		is_sorted;
+
+				is_sorted = pathkeys_contained_in(root->group_pathkeys,
+												  path->pathkeys);
+
+				/*
+				 * Use any available suitably-sorted path as input, and also
+				 * consider sorting the cheapest-total path.
+				 */
+				if (path != partially_grouped_rel->cheapest_total_path &&
+					!is_sorted)
+					continue;
+
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_final_costs, dNumGroups,
+												havingQual,
+												AGG_SORTED,
+												AGGSPLIT_FINAL_DESERIAL,
+												add_path);
+					continue;
+				}
 
 				/*
 				 * Insert a Sort node, if required.  But there's no point in
 				 * sorting anything but the cheapest path.
 				 */
-				if (!pathkeys_contained_in(root->group_pathkeys, path->pathkeys))
+				if (!is_sorted)
 				{
-					if (path != partially_grouped_rel->cheapest_total_path)
-						continue;
 					path = (Path *) create_sort_path(root,
 													 grouped_rel,
 													 path,
@@ -6568,7 +6693,10 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 			consider_groupingsets_paths(root, grouped_rel,
 										cheapest_path, false, true,
 										gd, agg_costs, dNumGroups,
-										AGG_HASHED);
+										havingQual,
+										AGG_HASHED,
+										AGGSPLIT_SIMPLE,
+										add_path);
 		}
 		else
 		{
@@ -6612,23 +6740,39 @@ add_paths_to_grouping_rel(PlannerInfo *root, RelOptInfo *input_rel,
 		{
 			Path	   *path = partially_grouped_rel->cheapest_total_path;
 
-			hashaggtablesize = estimate_hashagg_tablesize(path,
-														  agg_final_costs,
-														  dNumGroups);
+			if (parse->groupingSets)
+			{
+				/*
+				 * Try for a hash-only groupingsets path over unsorted input.
+				 */
+				consider_groupingsets_paths(root, grouped_rel,
+											path, false, true,
+											gd, agg_final_costs, dNumGroups,
+											havingQual,
+											AGG_HASHED,
+											AGGSPLIT_FINAL_DESERIAL,
+											add_path);
+			}
+			else
+			{
+				hashaggtablesize = estimate_hashagg_tablesize(path,
+															  agg_final_costs,
+															  dNumGroups);
 
-			if (enable_hashagg_disk ||
-				hashaggtablesize < work_mem * 1024L)
-				add_path(grouped_rel, (Path *)
-						 create_agg_path(root,
-										 grouped_rel,
-										 path,
-										 grouped_rel->reltarget,
-										 AGG_HASHED,
-										 AGGSPLIT_FINAL_DESERIAL,
-										 parse->groupClause,
-										 havingQual,
-										 agg_final_costs,
-										 dNumGroups));
+				if (enable_hashagg_disk ||
+					hashaggtablesize < work_mem * 1024L)
+					add_path(grouped_rel, (Path *)
+							 create_agg_path(root,
+											 grouped_rel,
+											 path,
+											 grouped_rel->reltarget,
+											 AGG_HASHED,
+											 AGGSPLIT_FINAL_DESERIAL,
+											 parse->groupClause,
+											 havingQual,
+											 agg_final_costs,
+											 dNumGroups));
+			}
 		}
 	}
 
@@ -6838,6 +6982,19 @@ create_partial_grouping_paths(PlannerInfo *root,
 											  path->pathkeys);
 			if (path == cheapest_partial_path || is_sorted)
 			{
+				if (parse->groupingSets)
+				{
+					consider_groupingsets_paths(root, partially_grouped_rel,
+												path, is_sorted, can_hash,
+												gd, agg_partial_costs,
+												dNumPartialPartialGroups,
+												NIL,
+												AGG_SORTED,
+												AGGSPLIT_INITIAL_SERIAL,
+												add_partial_path);
+					continue;
+				}
+
 				/* Sort the cheapest partial path, if it isn't already */
 				if (!is_sorted)
 					path = (Path *) create_sort_path(root,
@@ -6907,26 +7064,41 @@ create_partial_grouping_paths(PlannerInfo *root,
 	{
 		double		hashaggtablesize;
 
-		hashaggtablesize =
-			estimate_hashagg_tablesize(cheapest_partial_path,
-									   agg_partial_costs,
-									   dNumPartialPartialGroups);
-
-		/* Do the same for partial paths. */
-		if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
-			cheapest_partial_path != NULL)
+		if (parse->groupingSets)
 		{
-			add_partial_path(partially_grouped_rel, (Path *)
-							 create_agg_path(root,
-											 partially_grouped_rel,
-											 cheapest_partial_path,
-											 partially_grouped_rel->reltarget,
-											 AGG_HASHED,
-											 AGGSPLIT_INITIAL_SERIAL,
-											 parse->groupClause,
-											 NIL,
-											 agg_partial_costs,
-											 dNumPartialPartialGroups));
+			consider_groupingsets_paths(root, partially_grouped_rel,
+										cheapest_partial_path,
+										false, true,
+										gd, agg_partial_costs,
+										dNumPartialPartialGroups,
+										NIL,
+										AGG_HASHED,
+										AGGSPLIT_INITIAL_SERIAL,
+										add_partial_path);
+		}
+		else
+		{
+			hashaggtablesize =
+				estimate_hashagg_tablesize(cheapest_partial_path,
+										   agg_partial_costs,
+										   dNumPartialPartialGroups);
+
+			/* Do the same for partial paths. */
+			if ((enable_hashagg_disk || hashaggtablesize < work_mem * 1024L) &&
+				cheapest_partial_path != NULL)
+			{
+				add_partial_path(partially_grouped_rel, (Path *)
+								 create_agg_path(root,
+												 partially_grouped_rel,
+												 cheapest_partial_path,
+												 partially_grouped_rel->reltarget,
+												 AGG_HASHED,
+												 AGGSPLIT_INITIAL_SERIAL,
+												 parse->groupClause,
+												 NIL,
+												 agg_partial_costs,
+												 dNumPartialPartialGroups));
+			}
 		}
 	}
 
@@ -6970,6 +7142,9 @@ gather_grouping_paths(PlannerInfo *root, RelOptInfo *rel)
 	generate_gather_paths(root, rel, true);
 
 	/* Try cheapest partial path + explicit Sort + Gather Merge. */
+	if (root->parse->groupingSets)
+		return;
+
 	cheapest_partial_path = linitial(rel->partial_pathlist);
 	if (!pathkeys_contained_in(root->group_pathkeys,
 							   cheapest_partial_path->pathkeys))
@@ -7014,11 +7189,6 @@ can_partial_agg(PlannerInfo *root, const AggClauseCosts *agg_costs)
 		 */
 		return false;
 	}
-	else if (parse->groupingSets)
-	{
-		/* We don't know how to do grouping sets in parallel. */
-		return false;
-	}
 	else if (agg_costs->hasNonPartial || agg_costs->hasNonSerial)
 	{
 		/* Insufficient support for partial mode. */
diff --git a/src/backend/optimizer/plan/setrefs.c b/src/backend/optimizer/plan/setrefs.c
index 3dcded5..eae7d15 100644
--- a/src/backend/optimizer/plan/setrefs.c
+++ b/src/backend/optimizer/plan/setrefs.c
@@ -754,6 +754,22 @@ set_plan_refs(PlannerInfo *root, Plan *plan, int rtoffset)
 					plan->qual = (List *)
 						convert_combining_aggrefs((Node *) plan->qual,
 												  NULL);
+
+					/*
+					 * If it's groupingsets, we must add expression to evaluate
+					 * the grouping set ID and set the reference from the
+					 * targetlist of child plan node.
+					 */
+					if (agg->rollup)
+					{
+						GroupingSetId	*expr = makeNode(GroupingSetId);
+						indexed_tlist	*subplan_itlist = build_tlist_index(plan->lefttree->targetlist);
+
+						agg->gsetid = (Expr *) fix_upper_expr(root, (Node *)expr,
+															  subplan_itlist,
+															  OUTER_VAR,
+															  rtoffset);
+					}
 				}
 
 				set_upper_references(root, plan, rtoffset);
diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index 4578c31..f0f7cd5 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -2995,6 +2995,7 @@ create_groupingsets_path(PlannerInfo *root,
 						 List *rollups,
 						 const AggClauseCosts *agg_costs,
 						 double numGroups,
+						 AggSplit aggsplit,
 						 bool is_sorted)
 {
 	GroupingSetsPath *pathnode = makeNode(GroupingSetsPath);
@@ -3012,6 +3013,7 @@ create_groupingsets_path(PlannerInfo *root,
 		subpath->parallel_safe;
 	pathnode->path.parallel_workers = subpath->parallel_workers;
 	pathnode->subpath = subpath;
+	pathnode->aggsplit = aggsplit;
 	pathnode->is_sorted = is_sorted;
 
 	/*
@@ -3046,11 +3048,27 @@ create_groupingsets_path(PlannerInfo *root,
 	Assert(aggstrategy != AGG_PLAIN || list_length(rollups) == 1);
 	Assert(aggstrategy != AGG_MIXED || list_length(rollups) > 1);
 
+	/*
+	 * Estimate the cost of groupingsets.
+	 *
+	 * If we are finalizing grouping sets, the subpath->rows
+	 * contains rows from all sets, we need to estimate the
+	 * number of rows in each rollup. Meanwhile, the cost of
+	 * preprocess groupingsets is not estimated, the expression
+	 * to redirect tuples is a simple Var expression which is
+	 * normally cost zero.
+	 */
 	foreach(lc, rollups)
 	{
 		RollupData *rollup = lfirst(lc);
 		List	   *gsets = rollup->gsets;
 		int			numGroupCols = list_length(linitial(gsets));
+		int			rows = 0.0;
+
+		if (DO_AGGSPLIT_COMBINE(aggsplit))
+			rows = rollup->numGroups * subpath->rows / numGroups;
+		else
+			rows = subpath->rows;
 
 		/*
 		 * In AGG_SORTED or AGG_PLAIN mode, the first rollup do its own
@@ -3072,7 +3090,7 @@ create_groupingsets_path(PlannerInfo *root,
 
 				cost_sort(&sort_path, root, NIL,
 						  input_total_cost,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3090,7 +3108,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 input_startup_cost,
 					 input_total_cost,
-					 subpath->rows,
+					 rows,
 					 subpath->pathtarget->width);
 			is_first = false;
 		}
@@ -3102,7 +3120,6 @@ create_groupingsets_path(PlannerInfo *root,
 
 			sort_path.startup_cost = 0;
 			sort_path.total_cost = 0;
-			sort_path.rows = subpath->rows;
 
 			rollup_strategy = rollup->is_hashed ?
 				AGG_HASHED : (numGroupCols ? AGG_SORTED : AGG_PLAIN);
@@ -3112,7 +3129,7 @@ create_groupingsets_path(PlannerInfo *root,
 				/* Account for cost of sort, but don't charge input cost again */
 				cost_sort(&sort_path, root, NIL,
 						  0.0,
-						  subpath->rows,
+						  rows,
 						  subpath->pathtarget->width,
 						  0.0,
 						  work_mem,
@@ -3128,7 +3145,7 @@ create_groupingsets_path(PlannerInfo *root,
 					 having_qual,
 					 sort_path.startup_cost,
 					 sort_path.total_cost,
-					 sort_path.rows,
+					 rows,
 					 subpath->pathtarget->width);
 
 			pathnode->path.total_cost += agg_path.total_cost;
diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index 5e63238..5779d15 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -7941,6 +7941,12 @@ get_rule_expr(Node *node, deparse_context *context,
 			}
 			break;
 
+		case T_GroupingSetId:
+			{
+				appendStringInfoString(buf, "GROUPINGSETID()");
+			}
+			break;
+
 		case T_WindowFunc:
 			get_windowfunc_expr((WindowFunc *) node, context);
 			break;
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 4ed5d0a..4d36c2d 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -216,6 +216,7 @@ typedef enum ExprEvalOp
 	EEOP_XMLEXPR,
 	EEOP_AGGREF,
 	EEOP_GROUPING_FUNC,
+	EEOP_GROUPING_SET_ID,
 	EEOP_WINDOW_FUNC,
 	EEOP_SUBPLAN,
 	EEOP_ALTERNATIVE_SUBPLAN,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 1612b71..67b728a 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -285,6 +285,8 @@ typedef struct AggStatePerPhaseData
 	AggStatePerGroup *pergroups;	/* pergroup states for a phase */
 
 	bool		skip_evaltrans;		/* do not build evaltrans */
+#define FIELDNO_AGGSTATEPERPHASE_SETNOGSETIDS 12
+	int			*setno_gsetids;		/* setno <-> gsetid map */
 }			AggStatePerPhaseData;
 
 typedef struct AggStatePerPhaseSortData
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 688f0c7..20378eb 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2047,6 +2047,7 @@ typedef struct AggState
 	int			numtrans;		/* number of pertrans items */
 	AggStrategy aggstrategy;	/* strategy mode */
 	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
+#define FIELDNO_AGGSTATE_PHASE 6
 	AggStatePerPhase phase;		/* pointer to current phase data */
 	int			numphases;		/* number of phases (including phase 0) */
 	int			current_phase;	/* current phase number */
@@ -2070,8 +2071,6 @@ typedef struct AggState
 	/* These fields are for grouping set phase data */
 	int			maxsets;		/* The max number of sets in any phase */
 	AggStatePerPhase *phases;	/* array of all phases */
-	Tuplesortstate *sort_in;	/* sorted input to phases > 1 */
-	Tuplesortstate *sort_out;	/* input is copied here for next phase */
 	TupleTableSlot *sort_slot;	/* slot for sort results */
 	/* these fields are used in AGG_PLAIN and AGG_SORTED modes: */
 	HeapTuple	grp_firstTuple; /* copy of first tuple of current group */
@@ -2103,6 +2102,11 @@ typedef struct AggState
 
 
 	ProjectionInfo *combinedproj;	/* projection machinery */
+
+	/* these field are used in parallel grouping sets */
+	bool		groupingsets_preprocess; /* groupingsets preprocessed yet? */
+	ExprState	*gsetid;				/* expression state to get grpsetid from input */
+	int			*gsetid_phaseidxs;	/* grpsetid <-> phaseidx mapping */
 } AggState;
 
 /* ----------------
diff --git a/src/include/nodes/nodes.h b/src/include/nodes/nodes.h
index 8a76afe..a48a7af 100644
--- a/src/include/nodes/nodes.h
+++ b/src/include/nodes/nodes.h
@@ -153,6 +153,7 @@ typedef enum NodeTag
 	T_Param,
 	T_Aggref,
 	T_GroupingFunc,
+	T_GroupingSetId,
 	T_WindowFunc,
 	T_SubscriptingRef,
 	T_FuncExpr,
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index c1e69c8..2761fa6 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -1676,6 +1676,7 @@ typedef struct GroupingSetData
 {
 	NodeTag		type;
 	List	   *set;			/* grouping set as list of sortgrouprefs */
+	int			setId;			/* unique grouping set identifier */
 	double		numGroups;		/* est. number of result groups */
 } GroupingSetData;
 
@@ -1702,6 +1703,7 @@ typedef struct GroupingSetsPath
 	List	   *rollups;		/* list of RollupData */
 	List	   *qual;			/* quals (HAVING quals), if any */
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
+	AggSplit	aggsplit;		/* agg-splitting mode, see nodes.h */
 	bool		is_sorted;		/* input sorted in groupcols of first rollup */
 } GroupingSetsPath;
 
diff --git a/src/include/nodes/plannodes.h b/src/include/nodes/plannodes.h
index 3cd2537..5b1239a 100644
--- a/src/include/nodes/plannodes.h
+++ b/src/include/nodes/plannodes.h
@@ -20,6 +20,7 @@
 #include "nodes/bitmapset.h"
 #include "nodes/lockoptions.h"
 #include "nodes/primnodes.h"
+#include "nodes/pathnodes.h"
 
 
 /* ----------------------------------------------------------------
@@ -816,8 +817,9 @@ typedef struct Agg
 	uint64		transitionSpace;	/* for pass-by-ref transition data */
 	Bitmapset  *aggParams;		/* IDs of Params used in Aggref inputs */
 	/* Note: planner provides numGroups & aggParams only in HASHED/MIXED case */
-	List	   *groupingSets;	/* grouping sets to use */
+	RollupData *rollup;			/* grouping sets to use */
 	List	   *chain;			/* chained Agg/Sort nodes */
+	Expr	   *gsetid;			/* expression to fetch grouping set id */
 	Plan	   *sortnode;		/* agg does its own sort, only used by grouping sets now */
 } Agg;
 
diff --git a/src/include/nodes/primnodes.h b/src/include/nodes/primnodes.h
index d73be2a..f8f85d4 100644
--- a/src/include/nodes/primnodes.h
+++ b/src/include/nodes/primnodes.h
@@ -364,6 +364,12 @@ typedef struct GroupingFunc
 	int			location;		/* token location */
 } GroupingFunc;
 
+/* GroupingSetId */
+typedef struct GroupingSetId
+{
+	Expr		xpr;
+} GroupingSetId;
+
 /*
  * WindowFunc
  */
diff --git a/src/include/optimizer/pathnode.h b/src/include/optimizer/pathnode.h
index f9f388b..4fde8b2 100644
--- a/src/include/optimizer/pathnode.h
+++ b/src/include/optimizer/pathnode.h
@@ -218,6 +218,7 @@ extern GroupingSetsPath *create_groupingsets_path(PlannerInfo *root,
 												  List *rollups,
 												  const AggClauseCosts *agg_costs,
 												  double numGroups,
+												  AggSplit aggsplit,
 												  bool is_sorted);
 extern MinMaxAggPath *create_minmaxagg_path(PlannerInfo *root,
 											RelOptInfo *rel,
diff --git a/src/include/optimizer/planmain.h b/src/include/optimizer/planmain.h
index 5954ff3..e987011 100644
--- a/src/include/optimizer/planmain.h
+++ b/src/include/optimizer/planmain.h
@@ -54,7 +54,7 @@ extern Sort *make_sort_from_sortclauses(List *sortcls, Plan *lefttree);
 extern Agg *make_agg(List *tlist, List *qual,
 					 AggStrategy aggstrategy, AggSplit aggsplit,
 					 int numGroupCols, AttrNumber *grpColIdx, Oid *grpOperators, Oid *grpCollations,
-					 List *groupingSets, List *chain, double dNumGroups,
+					 RollupData *rollup, List *chain, double dNumGroups,
 					 Size transitionSpace, Plan *sortnode, Plan *lefttree);
 extern Limit *make_limit(Plan *lefttree, Node *limitOffset, Node *limitCount);
 
diff --git a/src/test/regress/expected/groupingsets.out b/src/test/regress/expected/groupingsets.out
index b29917c..62bf69f 100644
--- a/src/test/regress/expected/groupingsets.out
+++ b/src/test/regress/expected/groupingsets.out
@@ -1700,4 +1700,116 @@ set work_mem to default;
 drop table gs_group_1;
 drop table gs_hash_1;
 SET enable_groupingsets_hash_disk TO DEFAULT;
+--
+-- Compare results between parallel plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+create table gstest_p as select g%100 as g100, g%10 as g10, g
+from generate_series(0,199999) g;
+ANALYZE gstest_p;
+-- Prepared sort agg without parallelism
+set enable_hashagg = off;
+set min_parallel_table_scan_size = '128MB';
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+         QUERY PLAN         
+----------------------------
+ GroupAggregate
+   Sort Key: g100, g10
+     Group Key: g100, g10
+     Group Key: g100
+     Group Key: ()
+   Sort Key: g10
+     Group Key: g10
+   ->  Seq Scan on gstest_p
+(8 rows)
+
+create table p_gs_group_1 as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+-- Prepare sort agg with parallelism
+set min_parallel_table_scan_size = '4kB';
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+                   QUERY PLAN                    
+-------------------------------------------------
+ Finalize GroupAggregate
+   Filtered by: (GROUPINGSETID())
+   Sort Key: g100, g10
+     Group Key: g100, g10
+   Sort Key: g100
+     Group Key: g100
+   Group Key: ()
+   Sort Key: g10
+     Group Key: g10
+   ->  Gather
+         Workers Planned: 2
+         ->  Partial GroupAggregate
+               Sort Key: g100, g10
+                 Group Key: g100, g10
+                 Group Key: g100
+                 Group Key: ()
+               Sort Key: g10
+                 Group Key: g10
+               ->  Parallel Seq Scan on gstest_p
+(19 rows)
+
+create table p_gs_group_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+-- Prepare hash agg with parallelism
+SET enable_groupingsets_hash_disk = true;
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+                   QUERY PLAN                    
+-------------------------------------------------
+ Finalize MixedAggregate
+   Filtered by: (GROUPINGSETID())
+   Group Key: ()
+   Hash Key: g100, g10
+   Hash Key: g100
+   Hash Key: g10
+   ->  Gather
+         Workers Planned: 2
+         ->  Partial MixedAggregate
+               Group Key: ()
+               Hash Key: g100, g10
+               Hash Key: g100
+               Hash Key: g10
+               ->  Parallel Seq Scan on gstest_p
+(14 rows)
+
+create table p_gs_hash_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+RESET enable_sort;
+RESET work_mem;
+RESET enable_groupingsets_hash_disk;
+RESET min_parallel_table_scan_size;
+-- Compare results
+(select * from p_gs_group_1 except select * from p_gs_group_1_p)
+  union all
+(select * from p_gs_group_1_p except select * from p_gs_group_1);
+ g100 | g10 | sum | count | max 
+------+-----+-----+-------+-----
+(0 rows)
+
+(select * from p_gs_group_1 except select * from p_gs_hash_1_p)
+  union all
+(select * from p_gs_hash_1_p except select * from p_gs_group_1);
+ g100 | g10 | sum | count | max 
+------+-----+-----+-------+-----
+(0 rows)
+
+drop table gstest_p;
+drop table p_gs_group_1;
+drop table p_gs_group_1_p;
+drop table p_gs_hash_1_p;
 -- end
diff --git a/src/test/regress/sql/groupingsets.sql b/src/test/regress/sql/groupingsets.sql
index 77e1967..1b23baf 100644
--- a/src/test/regress/sql/groupingsets.sql
+++ b/src/test/regress/sql/groupingsets.sql
@@ -499,4 +499,68 @@ drop table gs_hash_1;
 
 SET enable_groupingsets_hash_disk TO DEFAULT;
 
+--
+-- Compare results between parallel plans using sorting and plans using hash
+-- aggregation. Force spilling in both cases by setting work_mem low
+-- and turning on enable_groupingsets_hash_disk.
+--
+create table gstest_p as select g%100 as g100, g%10 as g10, g
+from generate_series(0,199999) g;
+ANALYZE gstest_p;
+
+-- Prepared sort agg without parallelism
+set enable_hashagg = off;
+set min_parallel_table_scan_size = '128MB';
+
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+create table p_gs_group_1 as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+-- Prepare sort agg with parallelism
+set min_parallel_table_scan_size = '4kB';
+
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+create table p_gs_group_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+-- Prepare hash agg with parallelism
+SET enable_groupingsets_hash_disk = true;
+set enable_hashagg = true;
+set enable_sort = false;
+set work_mem='64kB';
+
+explain (costs off)
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+create table p_gs_hash_1_p as
+select g100, g10, sum(g::numeric), count(*), max(g::text) from
+gstest_p group by cube (g100,g10);
+
+RESET enable_sort;
+RESET work_mem;
+RESET enable_groupingsets_hash_disk;
+RESET min_parallel_table_scan_size;
+
+-- Compare results
+(select * from p_gs_group_1 except select * from p_gs_group_1_p)
+  union all
+(select * from p_gs_group_1_p except select * from p_gs_group_1);
+
+(select * from p_gs_group_1 except select * from p_gs_hash_1_p)
+  union all
+(select * from p_gs_hash_1_p except select * from p_gs_group_1);
+
+drop table gstest_p;
+drop table p_gs_group_1;
+drop table p_gs_group_1_p;
+drop table p_gs_hash_1_p;
 -- end
-- 
1.8.3.1

#29

Daniel Gustafsson

daniel@yesql.se

over 5 years ago

In reply to: Pengzhou Tang (#28)

Re: Parallel grouping sets

On 25 Mar 2020, at 15:35, Pengzhou Tang <ptang@pivotal.io> wrote:

Thanks a lot, the patch has a memory leak in the lookup_hash_entries, it uses a list_concat there
and causes a 64-byte leak for every tuple, has fixed that.

Also, resolved conflicts and rebased the code.

While there hasn't been a review of this version, it no longer applies to HEAD.
There was also considerable discussion in a (virtual) hallway-track session
during PGCon which reviewed the approach (for lack of a better description),
deeming that nodeAgg.c needs a refactoring before complicating it further.
Based on that, and an off-list discussion with Melanie who had picked up the
patch, I'm marking this Returned with Feedback.

cheers ./daniel